bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
194 stars 53 forks source link

Using somaticseq for somatic calls in non-human species and tumor-only mode #63

Closed sbamin closed 6 years ago

sbamin commented 6 years ago

Hi Li,

Thanks for maintaining SomaticSeq with an excellent documentation and inline comments. I have just read SomaticSeq manuscript and am thinking of using it for the dog genome mutation data (matched tumor-normal wgs+wes data) I would much appreciate your comments on following:

  1. Since top 18 or 20 features are largely based on individual callers and sequence quality, I like to give a try using default somaticseq features and model on canine mutation calls from gatk4 mutect2, varscan2 and lofreq. Based on --somaticseq-train option, I can train a model based on canine data (n=60-80), though truthset is limited (disease specific mutations, ~100-200) and not at par with one used in DREAM challenge.

    • What would be a better approach to run somaticseq in this context?
    • If I default to majority-vote consensus mode instead of training or prediction mode, how much trade-off in precision/recall I should expect? My dataset approximates to setting C (~pure Normal, contaminated tumor) of Fig 3 in Fang et al. 2015.
  2. Does somaticseq have an option to run in tumor-only mode or preferably supplying panel of normal vcf instead of matched normal bam file?

Thanks, Samir

litaifang commented 6 years ago

A few questions first 1) When you say n=60-80, do you mean 60-80 mutations, or 60-80 genomes? On the paper, Figure S6 of the supplemental material shows a plot of accuracy vs. data size (number of true positives). The accuracy rapidly goes up and then plateaus. If you're doing tumor-normal pair, you may want to consider use BamSurgeon to simulate synthetic mutations, e.g., https://github.com/bioinform/somaticseq/tree/master/utilities/dockered_pipelines/bamSimulator Though be careful about those "multiThread" scripts. A few of them assumes human reference contig names. 2) Majority-vote consensus mode, i.e., no machine learning, it'll just annotate "PASS, LowQual, and REJECT" based on the number of tools agreeing on a mutation call. Of course those information, plus a few important sequencing features will be output into the VCF file, but the annotation itself is quite simplistic, and we didn't investigate precisely how call agreement translates to accuracy. 3) SomaticSeq has tumor-only mode, use "single" instead of "paired" in the run_somaticseq.py command. One thing about tumor-only, is that it's really hard (maybe impossible) to create a synthetic training and truth set using BAMSurgeon schemes.

sbamin commented 6 years ago

Thanks!

For 1. I mean 80 genomes (~65 are paired Tumor-Normal, others are tumor-only). From S6 and S18,S19: I see that number of true positive mutations (~200) is what matters the most for training model. I can given a try to generate synthetic data (thanks for pointing out hardcoded lines in multithread scripts) but getting ground truth in Dog cancer genome is challenging given lack of validation data on somatic mutations.

For 2: I guess I am better off using default training model than majority-vote then.

For 3. Sorry, I missed that option in run_somaticseq.py. I second that concern on making synthentic set, and can use tumor-only mode for prediction only purpose.

litaifang commented 6 years ago

I would first try to create synthetic tumor-normal training data based on scenario No. 3 described here.

If you fall back to "default" mode, you don't need to do anything. If you don't specify a classifier or ground truth, it'll just default to consensus mode. It won't try to score the variant output, but it'll retain all the caller and sequencing information, then you may do your own filtering if you wish.

The R script r_scripts/ada_model_builder_ntChange.R creates a classifier if it's run on a TSV file with ground truth variant annotated. You can remove features like this: r_scripts/ada_model_builder_ntChange.R Ensemble.sSNV.tsv Consistent_Mates Inconsistent_Mates, i.e., the strings after the input file are features to be excluded in training.

sbamin commented 6 years ago

It's more clear now and I will follow up after working on some of those steps. You may close issue.

Thanks!