bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
194 stars 53 forks source link

SomaticSeq for targeted panel data #57

Closed Tato14 closed 6 years ago

Tato14 commented 6 years ago

Hi! First of all, thank you for this awesome tool! I would like to know if I could use SomaticSeq for target panel sequencing data. Concretely, I am managing Ion Torrent panel data. We have ~50 patients with mostly all the detected variants validated which, I guess, is a good starting point for the training step. However, I am not sure if the tool would perform good in target rather than WGS or WES.

Any ideas? Thanks!

litaifang commented 6 years ago

The framework should work for Ion Torrent, although I don't know how the indel performance will be like. If you want to train classifiers for target panel, you need to combine data sets together to create a training set large enough, i.e, it should have at least 1000 true variants (you may get away with a few hundreds, but more is better) in the training set, plus a larger number of false positives. Having validated mutations are good for training. You also need to know which regions and samples have no mutation (or being reasonably confident of such), such that the variants "detected" there will be correctly labeled as false positives.

Tato14 commented 6 years ago

Dear @ltfang-bina , thanks for the info! I see in your documentation that these files should be in VCF format but I'm not sure how I should introduce the parametres for true/false positives and true negatives. Do you have any example? Thanks!

litaifang commented 6 years ago

Variants in --truth-snv and --truth-indel are true positives. Everything else is considered true negatives. You may also have --inclusion-region and/or --exclusion-region to include/exclude regions with BED files, e.g., to remove regions where you don't know what's true positives and true negatives, and/or to only include regions where you're pretty confident of them.

Tato14 commented 6 years ago

Dear @ltfang-bina. Does somaticseq supports only tumor data? Thanks!

litaifang commented 6 years ago

You mean tumor data without a matched normal? Our algorithms do support that, but with very limited validations and research on its performance. Also it cannot distinguish somatic mutations from germline variants. Instead of submitcallers.sh, use the singleSamplecallers.sh at https://github.com/bioinform/somaticseq/tree/master/utilities/dockered_pipelines. Only call for callers that support single sample mode, i.e., mutect2, varscan2, vardict, lofreq, scalpel, and/or strelka.

Tato14 commented 6 years ago

Yes, I meant that :). Thank you so much for your huge help!

Tato14 commented 6 years ago

Hi @ltfang-bina. Sorry to open this discussion again but there is something I don't understand. I have no qsub option in the near future, so I'm performing all the callers separately. Is there any way to use SomaticSeq.Wrapper.sh on vcf files from the different single sample callers without matched normal (mutect2, varscan2, vardict, lofreq, scalpel, and/or strelka)? If so, which are the arguments I should use to specify the only tumor mode? Sorry if I'm missing or misunderstood something and thanks again for helping :)

litaifang commented 6 years ago

For single-sample mode, ssSomaticSeq.Wrapper.sh is the single-sample equivalent.

Tato14 commented 6 years ago

Thanks for the help! Now everything seem to work :) However, I'm trying with a positive control sample that I know that have a ch12:25398285 mutation. I tried to perform ssSomaticSeq.Wrapper.sh in the vcf from Mutect, VarScan and Lofreq in hg19 but the output didn't report the variant. Any ideas why is this happening? Files are attached. Thanks! Results_Callers.zip

litaifang commented 6 years ago

Thanks for bringing it to my attention. That was due to a bug in the ssSomaticSeq.Wrapper.sh script, such that the VCF merging step was not looking for the right VCF files, so nothing was collected. Fixed it in the new version v2.8.1.

Another suggestion though, it seems your Mutect2.vcf is actually unfiltered. You can filter your MuTect2 output with the following GATK4 commands: java -jar gatk.jar FilterMutectCalls --variant mutect2.vcf --output filtered.mutect2.vcf

It will attach flags such as PASS into filtered.mutect2.vcf, such that when a PASS flag is found in the FILTER column, SomaticSeq will properly count it as a MuTect2 call.

I used the Mutect2 in GATK4:

docker run --rm -v /:/mnt -u 1001 broadinstitute/gatk:4.0.5.2 \
java -Xmx7g -jar /gatk/gatk.jar Mutect2 \
--reference /mnt/SEQC2_Resources/GRCh38.d1.vd1.fa \
--intervals /mnt/genome.bed \
--input /mnt/tumor.markdup.indelRealigned.bam \
--input /mnt/tumor.markdup.indelRealigned.bam \
--normal-sample ${normal_name} \
--tumor-sample ${tumor_name} \
--native-pair-hmm-threads 4 \
--output /mnt/unfiltered.MuTect2.vcf

docker run --rm -u 1001 --memory 6g broadinstitute/gatk:4.0.5.2 \
java -Xmx7g -jar /gatk/gatk.jar FilterMutectCalls \
--variant /mnt/unfiltered.MuTect2.vcf \
--output /mnt/MuTect2.vcf
litaifang commented 6 years ago

Feel free to re-open if more issues turn up.