AlexandrovLab / SigProfilerExtractor

SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.
BSD 2-Clause "Simplified" License
154 stars 52 forks source link

Repeating SigProfileExtractor on ID83 signatures for PCAWG Head-SCC tumor - generates different results #94

Closed ipstone closed 3 years ago

ipstone commented 3 years ago

Hello,

Thanks for the nice package and great contributions to the understanding of mutation processes.

I am trying to repeat the ID83 signature result, in Head-SCC (n=57) tumors, with the PCAWG released mutation files. However, I could not find ID signature 6's contribution/activities in the ID83 signatures, though I consistently found ID1, ID2, ID8, etc. (although the activities are a little different from the publication figure 3).

I have attached my ID83 decomposition and COSMIC_ID83_TMB_plot_refit.pdf here, I am wondering what might be the cause of the different results?

COSMIC_ID83_TMB_plot_refit.pdf

ID83_Decomposition_Plots.pdf

One small detail I noticed is for Head-SCC, the samples used (n=54, on top of fig3-indel pannel), might be that 3 samples were excluded from the signature extraction step? If so, what might be the criteria to exclude these 3 samples?

Thanks a lot in advance, Isaac

marcos-diazg commented 3 years ago

Hi Isaac!

Thanks so much for your interest in our tool! Regarding your comments, it's important to mention that SigProfilerExtractor has undergone significant improvements since the results presented on Alexandrov et al. 2020 Nature. These results were generated some years ago. Actually, you can find more details about the newest computational upgrades in our preprint Islam et al. 2021 bioRxiv.

For your particular analysis, it seems that you are using a different number of samples from the original Head-SCC cohort from PCAWG (77 according to your TMB plot). Also, in the figure 3 you mentioned, the number n=54 corresponds to the number of samples after filtering the accuracy to a value greater than 0.9 for the mutational signature reconstruction (as described here and in the figure legend). You can find the complete dataset of signature attributions at https://www.synapse.org/#!Synapse:syn11738668.

Hope this helps and please let us know if you have further questions.

Marcos

ipstone commented 3 years ago

Thank you Marcos for the nice update; indeed in my previous run I included some additional tumors than the 57 HNSC cases in the paper.

I did run the sigProfleExtractor on 57 HNSC alone, somehow the result is similar to what is shown above (I will post the plots soon here). We are a little puzzled that ID6 signature disappeared 'consistently' in our runs, whereas it shows in the paper for HNSC.

When the COSMIC ID signature activity was the decomposed in the Nature 2020 paper, is it decomposed with all the tumors' data (~2778 tumors) together? If so, it might explain why when I tried to discover & decompose into COSMIC signatures in HNSC tumor data alone, ID6 is gone - perhaps a 'local' NMF factoring (among HNSC tumors alone), would find a different 'best' fit solution. Thoughts? Thanks again

Isaac

ipstone commented 3 years ago

Hi Marcos and everyone,

Here are the sigProfileExtract results using the 57 HNSC WGS tumors, with the default setting. ID8 is consistently shown with others, but we can not identify ID6 signature.

COSMIC_ID83_TMB_plot_refit.pdf ID_83_plots_COSMIC_ID83.pdf ID83_Decomposition_Plots.pdf

I am using the current version of sigProfileExtract (installed through pip), I also tried using cosmic 3.0, 3.2 signature versions and got same results (my guess is that there's no change for ID signatures between these versions).

As mentioned above, my current thought is that when just including HNSC tumors alone, NMF factoring solution finds different optimal fit, than when extracting with all tumor types. I would love to hear your thoughts. Thanks!

Isaac

marcos-diazg commented 3 years ago

Hi again Isaac,

Indeed, as I mentioned before, first of all, it's important to realize that the analyses displayed in the Nature 2020 paper were done some years ago, with a different tool. SigProfilerExtractor has undergone significant changes after that, so exact reproducibility is not expected (you have the details in the mentioned preprint).

Regarding your particular questions, reference ID signatures on the paper were derived from the full PCAWG cohort following the methods described there, but you can find local extractions for every cancer type here: https://www.synapse.org/#!Synapse:syn11853328. COSMIC v3.1 and v3.2 are different from v3.0 in terms of ID signatures, since they included novel ID18 related to colibactin exposure. I would suggest to have a look to the COSMIC mutational signatures website (https://cancer.sanger.ac.uk/signatures/), where you can find lots of useful information about every signature and the different versions.

Hope this helps. Please let us know if you have additional technical issues and happy to continue the research discussion over email when you want. You can find my contact information on my GitHub profile.

Thanks again for your interest!

Marcos

ipstone commented 3 years ago

Thanks Marcos! Your reply is very helpful.

I was curious what the results would be like if we are using the latest version of sigProfileExtract on the published version of PCAWG data (I am not sure whether the COSMIC website is re-running/repeating the analysis with the latest version of the software).

I will close this ticket and probably reach you via email on some very finer details. Thanks again for your feedback!

Isaac