bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
192 stars 53 forks source link

Special setting for b37? #112

Closed yoda0612 closed 2 years ago

yoda0612 commented 2 years ago

Is there any special setting should I do, if I use b37 as the reference? Becasue after I tried to use b37, the result of vcf calling seems not better than just use TNScope or MuTect2.

litaifang commented 2 years ago

No setting needed for any different reference files. Different reference genomes should give you slightly different results due to alignment differences. Ones with decoy contigs should reduce the number of false positives. Your b37 results were better or worse comparing to what?

yoda0612 commented 2 years ago

I used the comparsion tool hap.py to verify the results produced by SomaticSeq (input vcfs including MuTect2, Tnscope, LoFreg, VarDict and Strelka) with the results just from callers, MuTect2 and Tnscope. Seems the results of SomaticSeq were worse than both MuTect2 and Tnscope.

My script is: /opt/somaticseq/somaticseq/run_somaticseq.py \ --output-directory /volume/cyvolume/somaticseq/output/SomaticSeq \ --genome-reference /volume/cyvolume/somaticseq/human_g1k_v37_decoy.fasta \ --inclusion-region /volume/cyvolume/somaticseq/output/genome.bed \ --dbsnp-vcf /volume/cyvolume/somaticseq/dbsnp_138.b37.vcf \ --algorithm xgboost \ single \ --bam-file /volume/cyvolume/somaticseq/SRR13076390.realigned.bam \ --mutect2-vcf /volume/cyvolume/somaticseq/output/MuTect2.vcf \ --vardict-vcf /volume/cyvolume/somaticseq/output/VarDict.vcf \ --strelka-vcf /volume/cyvolume/somaticseq/output/Strelka/results/variants/variants.vcf.gz \ --arbitrary-snvs /volume/cyvolume/somaticseq/output/tnscope_snvs.vcf \ --arbitrary-indels /volume/cyvolume/somaticseq/output/tnscope_indels.vcf \ --lofreq-vcf /volume/cyvolume/somaticseq/output/LoFreq.vcf

I also noticed that every time I run SomaticSeq, I got the warning message: 022-03-28 02:28:55,091 - single_sample_vcf2tsv.py - INFO - NO RE-SCALING INFO 2022-03-28 02:28:55,091 single_sample_vcf2tsv.py NO RE-SCALING /usr/local/lib/python3.8/dist-packages/scipy/stats/stats.py:7042: RuntimeWarning: divide by zero encountered in double_scalars z = (bigu - meanrank) / sd

litaifang commented 2 years ago

Without a trained classifier, SomaticSeq is just a consensus caller. By default, PASS is labeled with majority vote, e.g., at least 3 out of 5 callers that you have used. So it's totally possible that MuTect2 (and TNscope is based on MuTect2) alone would outperform a majority vote with other tools, especially if the data more closely resemble the kind of data MuTect2 was developed on. To get the full advantage of SomaticSeq, you'd need to create a classifier based on training data and ground truth label, and then use the trained classifier to classify variant calls from the ensemble.

Those warning messages are normal. They're just scipy warning messages when doing statistical test with an empty array (which will result in nan in the .tsv file).

yoda0612 commented 2 years ago

I'd like to verify the correction of my usage. First I engage the training mode by the following command:

/opt/somaticseq/somaticseq/run_somaticseq.py \ --output-directory /volume/cyvolume/somaticseq/output/SomaticSeq \ --genome-reference /volume/cyvolume/somaticseq/human_g1k_v37_decoy.fasta \ --inclusion-region /volume/cyvolume/somaticseq/output/genome.bed \ --dbsnp-vcf /volume/cyvolume/somaticseq/dbsnp_138.b37.vcf \ --algorithm xgboost \ --somaticseq-train \ --truth-snv /volume/cyvolume/somaticseq/KP_snvs.vcf \ --truth-indel /volume/cyvolume/somaticseq/KP_indels.vcf \ single \ --bam-file /volume/cyvolume/somaticseq/SRR13076390.realigned.bam \ --mutect2-vcf /volume/cyvolume/somaticseq/output/MuTect2.vcf \ --vardict-vcf /volume/cyvolume/somaticseq/output/VarDict.vcf \ --strelka-vcf /volume/cyvolume/somaticseq/output/Strelka/results/variants/variants.vcf.gz \ --arbitrary-snvs /volume/cyvolume/somaticseq/output/tnscope_snvs.vcf \ --arbitrary-indels /volume/cyvolume/somaticseq/output/tnscope_indels.vcf \ --lofreq-vcf /volume/cyvolume/somaticseq/output/LoFreq.vcf

After this I obtained the classifiers Ensemble.sSNV.tsv.xgb.v3.7.1.classifier and Ensemble.sINDEL.tsv.xgb.v3.7.1.classifier. Then, I used these classifiers to predict my data by the command: /opt/somaticseq/somaticseq/run_somaticseq.py \ --output-directory /volume/cyvolume/somaticseq/output/SomaticSeq \ --genome-reference /volume/cyvolume/somaticseq/human_g1k_v37_decoy.fasta \ --inclusion-region /volume/cyvolume/somaticseq/output/genome.bed \ --dbsnp-vcf /volume/cyvolume/somaticseq/dbsnp_138.b37.vcf \ --classifier-snv /volume/cyvolume/somaticseq/output/SomaticSeq/Ensemble.sSNV.tsv.xgb.v3.7.1.classifier \ --classifier-indel /volume/cyvolume/somaticseq/output/SomaticSeq/Ensemble.sINDEL.tsv.xgb.v3.7.1.classifier \ --algorithm xgboost \ single \ --bam-file /volume/cyvolume/somaticseq/SRR13076390.realigned.bam \ --mutect2-vcf /volume/cyvolume/somaticseq/output/MuTect2.vcf \ --vardict-vcf /volume/cyvolume/somaticseq/output/VarDict.vcf \ --strelka-vcf /volume/cyvolume/somaticseq/output/Strelka/results/variants/variants.vcf.gz \ --arbitrary-snvs /volume/cyvolume/somaticseq/output/tnscope_snvs.vcf \ --arbitrary-indels /volume/cyvolume/somaticseq/output/tnscope_indels.vcf \ --lofreq-vcf /volume/cyvolume/somaticseq/output/LoFreq.vcf

Finally, I obtained SSeq.Classified.sINDEL.vcf and SSeq.Classified.sSNV.vcf. After evaluation of SSeq.Classified.sINDEL.vcf and SSeq.Classified.sSNV.vcf, it seems no differences with the results of using no training mode. I wonder if worse results was caused by the inputs of the vcfs with worse results like produced by VarDict and LoFreq. In my case, the better results from individual callers are MuTect2 and Tnscope.

litaifang commented 2 years ago

Yes, this is the essence of creating a classifier, and then use that classifier on another data set (although in your test sample, it was used to classify itself). In training, variants listed in KP_snvs.vcf and KP_indels.vcf are considered true positives, and everything else will be considered false positives. The xgboost model will try to learn features associated with each class. Results from the predicted SSeq.Classified.sINDEL.vcf and non-predicted Ensemble.vcf should be different in terms of which variants are labeled as PASS. The VCF file will include almost everything, but only those labeled with PASS are "predicted as real mutations." The PASS labels should be different for the two runs because they used different methods to label PASS. One complication for single sample mode, is that does KP_snvs.vcf and KP_indels.vcf contain germline variants? Since there is no good way to differentiate between somatic variants and germline variants, if KP_snvs.vcf and KP_indels.vcf contain only somatic variants, then the model may not be able to learn because it cannot find features to distinguish somatic variants labeled as true positives and germline variants labeled as false positives.

yoda0612 commented 2 years ago

Yes, my KP_snvs.vcf and KP_indels.vcf contain only somatic variants. But I have a little bit confusion about why this will influence the results of prediction. After check the Ensemble.sXXX.tsv (I guess SomaticSeq using this tabular data to train the model), it labels the somatic variants as TP and others as FP correctly. And the results of SSeq.Classified.sXXX.vcf contain almost variants in the KP_snvs.vcf and KP_indels.vcf, but almost half amount of them were not labeled as "PASS". That's the reason of low precision and recall. I wonder why those variants were identified as LowQual or Reject.

yoda0612 commented 2 years ago

I am interested that if you have pre-trained model available for test? As you mentioned you recommended the data from SEQ2.

litaifang commented 2 years ago

It's possible those true somatic mutations were labeled as false positives because their allele fractions were low (or maybe too high), so that it's difficult for the model to distinguish between true mutations and false positives (that may be germline variants).

You can get a sense what the decision trees are by looking at the Ensemble.sSNV.tsv.xgb.v3.x.x.classifier.txt file.

One thing that may confuse the classifier, is that germline variants are considered to be false positives if your KP_snvs.vcf and KP_indels.vcf contain only somatic mutations, but germline variants are a very large part of the call set.

In a single-sample mode, there really isn't anything that can definitely rule a variant as a germline, but the classifier will try to do so anyway. A somatic mutation may look like a germline variant with data only from the tumor sequencing data. Since germline variants, and all the features associated with them, are labeled as false positives, somatic mutation that look like germline variants will also be predicted as false positives because there are more of them in the training data.

In paired tumor-normal data, there are normal data that the machine learning model can learn as germline variants. In single-sample mode, there is no such information to make that determination.

So I think you have two options 1) Add germline variants into your KP_snvs.vcf and KP_indels.vcf, or 2) Remove germline variants from model consideration (which you may do by having --exclusion-region regions_to_exclude.bed.

litaifang commented 2 years ago

I don't have pre-trained model at this time. Models are specific to the callers you have used.

But if I want to train for a single-sample mode (i.e., without matched normal), I will either exclude all the germline variants from the training set, or add germline variants as true positives in the KP_snvs.vcf and KP_indels.vcf files.

yoda0612 commented 2 years ago

I would like to report that after I check the contents of Ensemble.sXXX.tsv, I found even I had set the --truth-snv /volume/cyvolume/somaticseq/KP_snvs.vcf and --truth-indel /volume/cyvolume/somaticseq/KP_indels.vcf, the colunm TrueVariant_or_False in Ensemble.sXXX.tsv seems not fully refers the truth data. Becasue, some variants appeared in truth set were still labeled 0 in Ensemble.sXXX.tsv.

litaifang commented 2 years ago

Can you send me the KP_*.vcf and the Ensemble.sXXX.tsv files? I'll see if I got bugs somewhere. Thanks.

yoda0612 commented 2 years ago

Sure, please let me know your you email address. Thank you.

litaifang commented 2 years ago

You may send the files to thecalbear@gmail.com.

litaifang commented 2 years ago

Discrepancies due to some VCF files sorted differently.