Getting similarities for CNV signatures

rajeha commented 2 years ago

Thank you for this great package! I have been following the documentation and performed CNV signature extraction using the Wang method. I am wondering if there is a method to make sense of extracted CNV signatures? I am able to visualize the CNV signatures but would like to get similarity estimates as is possible for single and dinucleotide polymorphisms (which works fine). I have tried getting similarity estimates like so:

sim <- get_sig_similarity(cn_sig2, sig_db = "CNS_USARC")

Error in get_sig_similarity(cn_sig2, sig_db = "CNS_USARC") : The following components cannot be found in reference! 5+:het:0-10Kb 5+:het:10Kb-100Kb 5+:het:100Kb-1Mb 5+:het:1Mb-10Mb 5+:het:>10Mb 5+:LOH:0-10Kb 5+:LOH:10Kb-100Kb 5+:LOH:100Kb-1Mb 5+:LOH:1Mb-10Mb 5+:LOH:>10Mb 0-1:homdel:0-10Kb 0-1:homdel:10Kb-100Kb 0-1:homdel:100Kb-1Mb 0-1:homdel:1Mb-10Mb 0-1:LOH:10Kb-100Kb 0-1:LOH:100Kb-1Mb 0-1:LOH:1Mb-10Mb 0-1:LOH:>10Mb 3-4:het:10Kb-100Kb 3-4:het:100Kb-1Mb 3-4:het:1Mb-10Mb 3-4:het:>10Mb 3-4:LOH:0-10Kb 3-4:LOH:10Kb-100Kb 3-4:LOH:100Kb-1Mb 3-4:LOH:1Mb-10Mb 3-4:LOH:>10Mb 2:het:10Kb-100Kb 2:het:100Kb-1Mb 2:het:1Mb-10Mb 2:LOH:0-10Kb 2:LOH:10Kb-100Kb 2:LOH:100Kb-1Mb 2:LOH:1Mb-10Mb 2:LOH:>10Mb 0-1:LOH:0-10Kb 3-4:het:0-10Kb 2:het:>10Mb 2:het:0-10Kb 0-1:homdel:>10Mb NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Here's what some of my signatures look like:

$Raw$W Sig1 Sig2 Sig3 BP10MB[0] 8.076369e+00 8.951173e+01 9.578712e+00 BP10MB[1] 7.565738e-01 2.859883e-01 8.763178e+00 BP10MB[2] 3.623968e+00 1.875211e+00 7.437020e-01 BP10MB[3] 3.376148e-01 1.033401e-01 6.091240e-01 BP10MB[4] 1.090426e+00 8.218691e-02 2.981371e-72 BP10MB[5] 9.665239e-02 0.000000e+00 6.634963e-02 BP10MB[>5] 5.467448e-01 0.000000e+00 2.765382e-02 BPArm[0] 1.015596e-12 1.137852e+01 4.762342e-52 BPArm[1] 1.482990e-01 1.569816e-01 6.689871e-01 BPArm[2] 3.846446e-69 1.383956e+00 9.023692e-01

Would appreciate any guidance. Thanks!

ShixiangWang commented 2 years ago

@rajeha Hi, for Wang approach (publish on Plos genetics), we only generate signatures for prostate cancer, if you are not studying this, I think you should explore the meaning of signature by your own. Maybe you can refer to https://xsliulab.github.io/PC_CNA_signature/.

If you want to explore copy number signature based on current discovery, you may want to use "S" (for method described in Steele et al. 2019).. This year, they publish many reference signatures on Nature. After you extract signature with method "S", you can obtain similarity with specifying sig_db = "CNS_TCGA", or you can obtain the reference signatures from COSMIC.

ShixiangWang commented 2 years ago

The prostate reference signature is attached here

Sig.CNV.seqz.W.RData.zip .

rajeha commented 2 years ago

@ShixiangWang thank you for the quick response. Using method S allowed me to get similarities to the CNS_USARC set, but not the CNS_TCGA set (same error as above). Any suggestions?

> sim <- get_sig_similarity(cn_sig, sig_db = 'CNS_TCGA')
Error in get_sig_similarity(cn_sig, sig_db = "CNS_TCGA") : 
  The following components cannot be found in reference!
0:homdel:0-100Kb 0:homdel:100Kb-1Mb 0:homdel:>1Mb 1:LOH:0-100Kb 1:LOH:100Kb-1Mb 1:LOH:1Mb-10Mb 1:LOH:10Mb-40Mb 1:LOH:>40Mb 2:LOH:0-100Kb 2:LOH:100Kb-1Mb 2:LOH:1Mb-10Mb 3-4:LOH:0-100Kb 3-4:LOH:100Kb-1Mb 3-4:LOH:>40Mb 5-8:LOH:0-100Kb 5-8:LOH:100Kb-1Mb 5-8:LOH:1Mb-10Mb 9+:LOH:0-100Kb 9+:LOH:100Kb-1Mb 9+:LOH:1Mb-10Mb 2:het:0-100Kb 2:het:100Kb-1Mb 2:het:1Mb-10Mb 2:het:10Mb-40Mb 2:het:>40Mb 3-4:het:0-100Kb 3-4:het:100Kb-1Mb 3-4:het:1Mb-10Mb 3-4:het:10Mb-40Mb 3-4:het:>40Mb 5-8:het:0-100Kb 5-8:het:100Kb-1Mb 5-8:het:1Mb-10Mb 5-8:het:10Mb-40Mb 5-8:het:>40Mb 9+:het:0-100Kb 9+:het:100Kb-1Mb 9+:het:1Mb-10Mb 9+:het:10Mb-40Mb 9+:het:>40Mb

> sim <- get_sig_similarity(cn_sig, sig_db = 'CNS_USARC')
-Comparing against COSMIC signatures
------------------------------------
--Found Sig1 most similar to USARC_CNS4
   Aetiology: See https://doi.org/10.1016/j.ccell.2019.02.002 [similarity: 0.637]
--Found Sig2 most similar to USARC_CNS4
   Aetiology: See https://doi.org/10.1016/j.ccell.2019.02.002 [similarity: 0.79]
--Found Sig3 most similar to USARC_CNS4
   Aetiology: See https://doi.org/10.1016/j.ccell.2019.02.002 [similarity: 0.344]
------------------------------------
Return result invisiblely.

ShixiangWang commented 2 years ago

@rajeha The "S" approach generates two matrices with different catalogs, for "USARC", it's 40 catalogs (designed in a cancer cell paper); while for "TCGA", it's 48 catalogs (designed in a nature paper). For getting similarity to CNS_TCGA, you have to extract signature with 48-catalog matrix.

A full example is given below:

load(system.file("extdata", "toy_segTab.RData",
                 package = "sigminer", mustWork = TRUE
))
cn <- read_copynumber(segTabs,
                      seg_cols = c("chromosome", "start", "end", "segVal"),
                      genome_build = "hg19", complement = FALSE
)
cn

set.seed(1234)
segTabs$minor_cn <- sample(c(0, 1), size = nrow(segTabs), replace = TRUE)
cn <- read_copynumber(segTabs,
                      seg_cols = c("chromosome", "start", "end", "segVal"),
                      genome_measure = "wg", complement = TRUE, add_loh = TRUE
)
# Use tally method "S" (Steele et al.)
tally_s <- sig_tally(cn, method = "S")

cn_sig = sig_extract(tally_s$all_matrices$CN_48, n_sig = 2)
get_sig_similarity(cn_sig, sig_db = "CNS_TCGA")

ShixiangWang commented 2 years ago

I am closing as it it completed. Feel free to reopen it if you have further questions.

rajeha commented 2 years ago

Yes that was super helpful, thanks!

ShixiangWang / sigminer

Getting similarities for CNV signatures #414