AlexandrovLab / SigProfilerExtractor

SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.
BSD 2-Clause "Simplified" License
153 stars 51 forks source link

Extraction of SBS10a/SBS10b signatures and row order in the input matrix #63

Closed rjaksik closed 3 years ago

rjaksik commented 3 years ago

Hi, I am using SigProfilerExtractor to determine the mutational signatures of a set of 43 WGS samples with polymerase epsilon exonuclease domain mutations. Some of the samples originate from TCGA and are known POLE mutants, however in none of the cases the SigProfilerExtractor choses one of the POLE specific SBS signatures (10a, 10b, or 14). The problem is most evident for sample TCGA-AA-3510 which shows a signature very similar to SBS10a combined with SBS10b, however its matched to SBS5 95.16% and SBS1 4.84%:

Wrong_Order_TCGA-AA-3510_SBS96_Decomposition_Plots.pdf Input matrix: TCGA-AA-3510_SBS96.txt

The figure was obtained with: sig.sigProfilerExtractor("matrix", "results_TCGA-AA-3510", "ORI_SigProfilerMatrix_TCGA-AA-3510_SBS96.txt", reference_genome="GRCh37", minimum_signatures=1, maximum_signatures=10, nmf_replicates=100, cpu=40)

The SigProfilerMatrix (SBS96) was created using my R script based on MuTect (v1) results (filtered), since I dont have the VCFs. I initially thought that the problem might be related to a different order of mutation counts in the input matrix, which affected the plot above (the Original panel has a different mutation order compared to the Reconstructed one). However after correcting the counts table and running SigProfilerExtractor another problem emerged, since not only the reconstruction appears to be bad, but significantly different from the first version:

Correct_Order_TCGA-AA-3510_SBS96_Decomposition_Plots.pdf Input matrix: TCGA-AA-3510_SBS96_v2.txt Both input files differ only in the row order.

I also have a minor issue with the De_Novo_map_to_COSMIC_SBS96.csv results file which only rarely contains the breakdown of the reconstructed signature, while the breakdown is always shown on the SBS96_Decomposition_Plots.pdf. Can this be controlled by any of the parameters? I would like to obtain the breakdown for each of my 43 samples, which doesn’t seem to be possible in a multi-sample run and in most cases can be only read from the plot while running a single sample.

I will be very grateful for your help.

Best regards, Roman

mishugeb commented 3 years ago

Samples.txt Hi Roman, Both the orders you are using are not correct for SigProfilerExtractor input. I have attached the correct order of the mutation which is an output of the SigProfilerMatrix generator.

Also, if Devono signatures don't decompose to the COSMIC signatures with at least 0.8 cosine-similarity, you will not see any breakdown in the De_Novo_map_to_COSMIC_SBS96.csv results file. However, you will see the breakdown in the decomposition plot.

Thanks, Mishu

rjaksik commented 3 years ago

Hi Mishu, Thank you, this indeed solved the problem, providing a much better fit that includes signatures 10a and 10b.

I assumed that the software checks what’s in the first column. It would be great to have an input validation step, as this problem may lead to some false findings.

Best regards, Roman

mishugeb commented 3 years ago

Thanks for the suggestion. We will include the validation step.

Best, Mishu