AlexandrovLab / SigProfilerExtractor

SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.
BSD 2-Clause "Simplified" License
151 stars 51 forks source link

Extract COSMIC Signatures from TCGA #101

Closed onebeingmay closed 2 years ago

onebeingmay commented 2 years ago

Hello SigProfilerExtractor team, Thank you for developing such a nice tool! I am trying to extract COSMIC Signatures from whole exome sequencing data (vcfs downloaded from TCGA), but in the result activity table "COSMIC_SBS96_Activities_refit.txt" some of the signature activities are only detected in very few number of samples. Here are the first several rows and columns (each row is a sample):

SBS1 SBS2 SBS3 SBS4 SBS5 SBS8 SBS10a SBS10b SBS13 SBS15 SBS17a SBS19 193 0 0 0 1414 0 0 0 0 0 0 0 286 0 0 651 2665 0 0 0 0 0 0 0 146 0 0 988 2678 0 0 0 0 339 0 0 701 0 0 1565 5768 0 0 0 0 0 0 0 370 0 0 821 3910 0 0 0 0 0 0 0 2529 0 0 0 19502 0 0 0 0 0 0 0 132 0 0 0 2370 0 0 0 0 641 0 0 176 0 0 700 1089 0 0 0 0 0 0 0 301 0 0 0 2589 0 0 0 0 0 0 0 366 0 0 0 5519 0 0 0 0 0 0 0 1308 0 0 0 9134 0 0 0 0 0 0 0 542 0 0 810 3852 0 0 0 0 0 0 0 352 0 0 0 3102 0 0 0 0 0 0 0 1199 0 0 0 11443 0 0 0 0 0 0 0 966 0 0 0 7113 0 0 0 0 0 0 0 427 0 0 1462 2620 0 0 0 0 0 0 0 210 0 0 0 2894 0 0 0 0 0 0 0 556 0 0 0 3349 0 0 0 0 0 0 0 527 0 0 0 5900 0 0 0 0 0 0 0 364 670 0 0 5345 0 0 0 617 0 0 0 230 0 0 444 1473 0 0 0 0 0 0 0 230 0 0 1211 2225 0 0 0 0 0 0 0

Some of the samples don't have some signature activity at all (appearing 0). While this may be real I suspect I made some mistakes. Here is my code for running the program:

from SigProfilerExtractor import sigpro as sig
if __name__ == '__main__':    
    sig.sigProfilerExtractor('vcf', 'result', 'data', reference_genome='GRCh38', cpu=24, minimum_signatures=1, maximum_signatures=30)

Any idea would be greatly appreciated! Wenbin

mdbarnesUCSD commented 2 years ago

Hi @onebeingmay,

Each sample is a row in the activities matrix. Not all signatures will be present in each sample, so this is not unexpected. You can learn more about the output at the wiki page for SigProfilerExtractor.

Best, Mark

onebeingmay commented 2 years ago

Thanks for the information Mark @mdbarnesUCSD ! I reviewed my pipeline and still suspect I may have done something wrong:

  1. I am working on exome sequencing data so I included the "exome=True" and reran the program: sig.sigProfilerExtractor('vcf', 'result', 'data', reference_genome='GRCh38', cpu=24, minimum_signatures=5, maximum_signatures=20, exome=True). This time I got fewer zeros.
  2. My "SBS96 selection plot" looks quite different from the example plot in wiki. The "Mean Sample Cosine Distance" keeps decreasing as k increases, but "Avg Stability" fluctuates.
  3. The results of different k correlate poorly. For example, here is the Pearson correlation matrix of SBS3 activity using k=10, 13, 15, 17, all with descent "Mean Sample Cosine Distance" and "Avg Stability". The optimal k selected by the program is 15.
      k=10      k=13      k=15      k=17

k=10 1.0000000 0.4861054 0.3489011 0.5399221 k=13 0.4861054 1.0000000 0.3201084 0.3716092 k=15 0.3489011 0.3201084 1.0000000 0.5588446 k=17 0.5399221 0.3716092 0.5588446 1.0000000

My question are: 1. is "exome=True" suitable for whole-exome sequencing? 2. why the correlation between different k is poor? Thank you Wenbin