Closed sbamin closed 6 years ago
A few questions first 1) When you say n=60-80, do you mean 60-80 mutations, or 60-80 genomes? On the paper, Figure S6 of the supplemental material shows a plot of accuracy vs. data size (number of true positives). The accuracy rapidly goes up and then plateaus. If you're doing tumor-normal pair, you may want to consider use BamSurgeon to simulate synthetic mutations, e.g., https://github.com/bioinform/somaticseq/tree/master/utilities/dockered_pipelines/bamSimulator Though be careful about those "multiThread" scripts. A few of them assumes human reference contig names. 2) Majority-vote consensus mode, i.e., no machine learning, it'll just annotate "PASS, LowQual, and REJECT" based on the number of tools agreeing on a mutation call. Of course those information, plus a few important sequencing features will be output into the VCF file, but the annotation itself is quite simplistic, and we didn't investigate precisely how call agreement translates to accuracy. 3) SomaticSeq has tumor-only mode, use "single" instead of "paired" in the run_somaticseq.py command. One thing about tumor-only, is that it's really hard (maybe impossible) to create a synthetic training and truth set using BAMSurgeon schemes.
Thanks!
For 1. I mean 80 genomes (~65 are paired Tumor-Normal, others are tumor-only). From S6 and S18,S19: I see that number of true positive mutations (~200) is what matters the most for training model. I can given a try to generate synthetic data (thanks for pointing out hardcoded lines in multithread scripts) but getting ground truth in Dog cancer genome is challenging given lack of validation data on somatic mutations.
run_somaticseq.py
but can't find an option.For 2: I guess I am better off using default training model than majority-vote then.
For 3. Sorry, I missed that option in run_somaticseq.py
. I second that concern on making synthentic set, and can use tumor-only mode for prediction only purpose.
I would first try to create synthetic tumor-normal training data based on scenario No. 3 described here.
If you fall back to "default" mode, you don't need to do anything. If you don't specify a classifier or ground truth, it'll just default to consensus mode. It won't try to score the variant output, but it'll retain all the caller and sequencing information, then you may do your own filtering if you wish.
The R script r_scripts/ada_model_builder_ntChange.R creates a classifier if it's run on a TSV file with ground truth variant annotated. You can remove features like this:
r_scripts/ada_model_builder_ntChange.R Ensemble.sSNV.tsv Consistent_Mates Inconsistent_Mates
, i.e., the strings after the input file are features to be excluded in training.
It's more clear now and I will follow up after working on some of those steps. You may close issue.
Thanks!
Hi Li,
Thanks for maintaining SomaticSeq with an excellent documentation and inline comments. I have just read SomaticSeq manuscript and am thinking of using it for the dog genome mutation data (matched tumor-normal wgs+wes data) I would much appreciate your comments on following:
Since top 18 or 20 features are largely based on individual callers and sequence quality, I like to give a try using default somaticseq features and model on canine mutation calls from gatk4 mutect2, varscan2 and lofreq. Based on
--somaticseq-train
option, I can train a model based on canine data (n=60-80), though truthset is limited (disease specific mutations, ~100-200) and not at par with one used in DREAM challenge.majority-vote consensus mode
instead of training or prediction mode, how much trade-off in precision/recall I should expect? My dataset approximates to setting C (~pure Normal, contaminated tumor) of Fig 3 in Fang et al. 2015.Does somaticseq have an option to run in tumor-only mode or preferably supplying panel of normal vcf instead of matched normal bam file?
Thanks, Samir