getzlab / MutSig2CV

MutSig2CV from Lawrence et al. 2014
Other
30 stars 8 forks source link

duplicate patients removed when I set params.txt not to do this #28

Open eltonjrv opened 9 months ago

eltonjrv commented 9 months ago

Dear MutSig2CV team,

Even setting params.txt not to remove duplicate patients, my execution is doing that anyways.


MUTSIG_VERSION =
2CV v3.11 LOADING DATA Processing target list. Loading mutations... Keeping 471860/4208232 unique mutations. Scanning for duplicate patients... Comparing on the basis of coding mutations only... convert_chr: assuming human for chrX/chrY 13 patients involved in an overlap. 1 cliques of overlapping samples. 1 unique samples. Removing the following 12 duplicate patients: . . . Loading coverage models... Enforcing target list. Mapping mutations to targets: Including noncoding mutations. convert_chr: assuming human for chrX/chrY chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chr23 chr24 Removing 855/34379 mutations that fall outside target gene intervals. Reassigning the following 2939 gene identities: . . . Looking up "effect" in mutation_type_dictionary_file Omitting 2/33524 mutations of unknown "effect" Converting mutation data... convert_chr: assuming human for chrX/chrY 1 patients WARNING: MutSig is not applicable to single patients. Imputing callschemes Error using * Inner matrix dimensions must agree. Error in MutSig_2CV_v3_11_core (line 308) Error in MutSig_2CV_v3_11_wrapper (line 50) MATLAB:innerdim

And this is my params.txt

number_of_categories_to_discover 3 skip_permutations false maxperm 1e4 remove_duplicate_patients false

Any clue on why this is happening and how to circumvent it? I have only 13 patients, by the way.

Thanks, Elton

julianhess commented 9 months ago

Generally, the duplicate patient filter only gets activated when samples share many mutations because they are not from completely independent tumors, which violates MutSig’s statistical model. Are some of the samples you’re passing to MutSig multiple samples from the same patient (e.g. pre/post treatment, primary/met, etc.)?

eltonjrv commented 9 months ago

Hi Julian, Answering your question: No, all my 13 samples are completely independent from one another. In order to make the tool to load mutations faster, I've been running initial trials on a selected number of genes. I'm gradually increasing my trials from 1k to 5k genes, and still getting that same error message (12 patients removed due to mutation overlaps, even though I'm setting not to remove duplicates). I know that they share lots of mutations, but are definitely not identical. Any other clue to put the tool to run without removing duplicates? Thanks, Elton PS: When attempting to run on the whole genome, I either get "segmentation violation" error on a 128Gb RAM node, or it keeps stuck for 24 hours on the "Loading mutations..." step in a 800Gb RAM node from my HPC.

julianhess commented 9 months ago

No, all my 13 samples are completely independent from one another.

In that case, it sounds like something is wrong with your mutation calls. Here is the criterion used for determining whether a pair of patients are involved in an overlap:

https://github.com/getzlab/MutSig2CV/blob/0109e27e70478181695f31ca8dd281bb44f0b3af/src/new_find_duplicate_samples.m#L42

where ni is the number of mutations involved in an overlap between two patients, and fi is the fraction of mutations overlapping (fi = ni/max(n1, n2))

So each pair of patients in your cohort share at least 10% of their mutations, which is implausible. Is it possible that there are contamination issues, or germline leakage?