How exactly is 96 context counted?

kimin0402 commented 3 years ago

Hi, thank you for all your effort. This package is wonderful.

I was using SigProfilerMatrixGeneratorFunc to count 96 context mutations and found that the result is different between 'SNV only VCF' and 'SNV + INDEL VCF'.

For example, my VCF has total of 5482 mutations of which 4510 are SNVs and 972 are INDELs. (I obtained the number of SNVs and INDELs using python package cyvcf2 and GATK SelectVariants, and the results are consistent.)

I made two VCFs, one containing only SNVs (total of 4510 mutations) and the other containing SNVs and INDELs (total of 5482 mutations). When I run these two VCFs in SigProfilerMatrixGeneratorFunc , these are the results I get. The actual codes are as follows:

from SigProfilerMatrixGenerator.scripts import SigProfilerMatrixGeneratorFunc as matGen
sigprofiler_snv_count = matGen.SigProfilerMatrixGeneratorFunc("matrix_generation_snv", "GRCh37", "02_vcf/test_snv", tsb_stat= True)
sigprofiler_both_count = matGen.SigProfilerMatrixGeneratorFunc("matrix_generation_both", "GRCh37", "02_vcf/test_both", tsb_stat= True)
print(np.array_equal(sigprofiler_both_count['96'], sigprofiler_snv_count['96']))

and the result is like this:

Matrices generated for 1 samples with 0 errors. Total of 4928 SNVs, 209 DINUCs, and 752 INDELs were successfully analyzed. Matrices generated for 1 samples with 0 errors. Total of 4510 SNVs, 0 DINUCs, and 0 INDELs were successfully analyzed. False

I looked at the source code of SigProfilerMatrixGeneratorFunc and it seems like extra counts of SNVs are from VCF rows where REF and ALT both have length 2. When SigProfilerMatrixGeneratorFunc processes these rows it divides them into two separate SNVs. After all, if the number of DINUCs (209) are multiplied by two and subtracted from the number of SNVs (4928), the result is 4510.

I am bit confused at this result. Is this how the mutations were counted for extracting COSMIC signatures? (Both v2 and v3?) If so, what is the rationale behind this? It seems like mutations from DINUC could influence both SNV signatures and DINUC signatures. If we are taking account all the SNVs from dinucleotide separately, why not consider the same for trinucleotide mutations? (Although the probability of trinucleotide mutation is very low, I think the code should still take this case into account.)

ebergstr commented 3 years ago

Hi, thank you for the feedback and questions!

You are correct in noticing that we separate DBSs into two single SNVs, which are processed as both SNVs and DBSs for matrix generation purposes (which is how we have historically performed this analysis). To remove these multi-base substitution events that violate the assumption that each mutation is independent (DBSs, multi-base substitutions, and other clustered events such as large kataegic events), we are developing SigProfilerHotSpots, which will take a collection of genomes and partition mutations into clustered and non-clustered sets in a sample-dependent manner. These two partitions can subsequently be run through the extractor tool separately, thus removing any effects of clustered events on the general, non-clustered SNV signatures.

This tool is currently under development, however, it should be released in the next few months.

Best, Erik

kimin0402 commented 3 years ago

Thank you Erik

andreyurch commented 3 years ago

Dear developers,

Would not it be possible to make an option for sigprofilermatrixgenerator to treat DBS and SBS separately? For some mutational process (like UV light) we are pretty sure that DBS (CC>TT) are really DBS and not two SBS occurred near each other. So to count them in DBS and SBS simultaneously would be a methodological error.

Best regards, Andrey

ebergstr commented 3 years ago

By default, the tool does do this using our sefInfo parameter set to True. Indeed, we (and others) have shown previously that DBSs are likely single events, so using this parameter will reflect this methodology.

Best, Erik

On Sun, Apr 4, 2021 at 4:12 AM andreyurch @.***> wrote:

Dear developers,

Would not it be possible to make an option for sigprofilermatrixgenerator to treat DBS and SBS separately? For some mutational process (like UV light) we are pretty sure that DBS (CC>TT) are really DBS and not two SBS occurred near each other. So to count them in DBS and SBS simultaneously would be an methodological error.

Best regards, Andrey

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/AlexandrovLab/SigProfilerMatrixGenerator/issues/47#issuecomment-813015029, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIRQ3Q2WRWTZKHX6EFGONR3THBCSHANCNFSM4VEAUPHQ .

AlexandrovLab / SigProfilerMatrixGenerator

How exactly is 96 context counted? #47