bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

questions on paired tumor template #2796

Closed maximillo closed 5 years ago

maximillo commented 5 years ago

Hi Brad,

The latest template for paired tumor (https://github.com/bcbio/bcbio-nextgen/blob/master/config/templates/tumor-paired.yaml) uses vardict, mutect2 and strelka2. I wonder why freebayes or varscan are excluded. I tried including freebayes and varscan as well and it ran through. But I'm not exactly sure what the guidelines I should be following in order to obtain a set of more "reliable" variants. In one of my tests with paired prostate tumor samples, each caller spat out ~30,000 of SNVs and the ensemble was down to a few hundred. Obviously, they did not overlap very well. This got me wonder which caller should I trust more or none. What would you be doing if you were to qc the variant calling result in a clinical setting? I know this is probably a recurring question for you from time to time so I'm sorry if it sounds boring and too broad. Thanks a lot!

Max

matthdsm commented 5 years ago

Hi,

If I were you, I'd compare the variants you found with each variant caller with some kind of in house truth set, analog to the GiaB approach. Take a sample with verified variants, sequence it and then run with the various variant callers. The one with the most hits wins. We did this for an in house pipeline, and in our case, vardict won the race.

Hope this helps. Cheers M

chapmanb commented 5 years ago

Max -- thanks for this discussion. Matthias is exactly right on. It's very difficult to give general recommendations because the answers can depend on your sample types. If you're doing clinical sequencing you need truth sets and to ensure that you're getting good results in your environment with representative samples. If it's helpful here are my general thoughts on useful callers to start with for a number of cases:

https://github.com/chapmanb/bcbb/blob/master/talks/bcbio2019_recomendations/bcbio2019_recommendations.pdf

For paired analyses: vardict, mutect2 and strelka2 are great choices to start with for validating your samples. Thanks again.

maximillo commented 5 years ago

Hi Matthias, Hi Brad,

Many thanks for your advises! Would you be able to elaborate a little on the "truth set"? I understand that in same cases people could simulate the reads with given variants but I don't know what could serve as a truth set in a clinical setting. A typical target panel DNA sequencing service provider would generally submit to clinicians a clinical report, in which a minimal set of filtered/selected SNVs, Indels, CNVs etc. are included without given details on how they were called and filtered. Could this be treated as the truth set (of course, we'd have to assume they know exactly what they are doing)?

Max

matthdsm commented 5 years ago

Hi,

A truth set can be any data with known and verified variants. For us, we used a particulary difficult sample, which we had already sequenced and studied several times. So yes, you could use the variants you got from the company, but I'd take great care with filtering and processing, as this can have a huge effect on your data. In my experience, it's best to eliminate as much factors as you can before evaluating a variant caller. Here we used the same prep kit and method and the same sequencer for our "truth set" and our sample set.

Hope this helps M

maximillo commented 5 years ago

Hi Matthias,

Thank you so much for your input! Yes, this definitely helps a lot in my search for the right direction of validating my calling result. I'm closing this thread now.

Max