bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
992 stars 354 forks source link

ensemble numpass confusion #1018

Closed parlar closed 9 years ago

parlar commented 9 years ago

When running ensemble variant calling using "numpass: 1" I notice that variants identified by only one caller but which fail filters still become included in the final call as PASS. Perhaps true also for higher numpass settings? Is this by intention? if so, the name of the setting (numpass) becomes somewhat confusing.

Information that would be great to have in the ensemble vcf INFO field is also what callers that actually picked up a variant.

cheers, Pär

chapmanb commented 9 years ago

Pär; Sorry about the confusion with this. For calling on germline samples, the current approach in bcbio is to keep filtered samples with the idea of providing evidence in cases where a caller "almost" identifies a variant to support a place where it is called in another caller. When validating with a larger number of callers (3+) this helped improve sensitivity.

Practically though, I've been contemplating deprecating ensemble calling for germline samples. FreeBayes and HaplotypeCaller are quite good and the ensemble only provides a small improvement over them at the cost of more complex integration maintenance and fixes.

In your case specifically, numpass: 1 setting would basically take a union of all called variants including pulling in all false positives from all callers. Is this your intention with this setting? I'm just trying to understand the use cases and decide the best plan moving forward.

For somatic calling, we currently still find value in ensemble validations and remove filtered variants as leaving them in keeps a lot of germline noise. Longer term, we're hoping to evaluate and move to another ensemble callers like SomaticSeq (https://github.com/bioinform/somaticseq) to continue to contribute to other open source projects.

Happy to hear your thoughts and hope this helps some with current thinking on ensemble calling.

parlar commented 9 years ago

Hi Brad,

Here at Umeå University Hospital, Sweden, we're using bcbio-nextgen for clinical variant calling of germline variants using a Haloplex panel. Variants likely to be disease-causing are first checked by visual inspection of the alignments and later also confirmed by Sanger. So, I'm more afraid of false negatives (poor sensitivity) than of false positives. But It's true that "numpass 1" gives many false positives, so I've actually switched to "numpass 2" (four callers).

A general problem for most labs (if not all) in this area is availability of well-characterized reference materials. Although we have validated the pipeline against samples where we also have Sanger data, my opinion is that these validation sets are too small. It's better to be on the safe side and generate false positives that are filtered out downstream rather than missing something potentially important.

I've made a system here that integrates bcbio-nextgen with a local variant database and Alamut, a tool use by most clinical genetics labs for interpretation I believe. The system annotates the variants using various dbs, including the local, annotations made by the team in Alamut, and generates a report in excel format that includes all variants relevant to the medical hypothesis/indication (based on gene lists), other relevant qc information, and provenance. Hyperlinks in the excel sheets then allows us to quickly open bam files and regions in Alamut for assessment of variants and quality. Simplistic but it works. Planning to put everything on github and submit a paper.

chapmanb commented 9 years ago

Pär; Thanks much for the details, this helps so much in understanding your use case. I'd still like to think about larger scale validations to determine the utility of ensemble, but it seems like we can try a few tweaks to improve your workflow now:

The clinical reporting system is really interesting and I know would be of use to many users with similar needs. I'd love to include links to it from the documentation when it's available and also happy to try and make changes to bcbio to fit better with this type of downstream annotation and reporting. Thanks again.

parlar commented 9 years ago

Brilliant! We are really grateful for your help and effort!

Maybe "rescuing" is still a good idea if details were provided in the INFO field on the filters?

Yes, I figured that others might be interested in the clinical interpretation / reporting system as well. There are some other alternatives out there (Scout from Scilife for example), each with their own merits and limitations. Alamut integration was important for us, and flexibility. Diversity is a good thing I guess.

ohofmann commented 9 years ago

Pär, a +1 for me for being excited about the clinical reporting code be made available. Even if it's just a starting point it would make clinicians happy here in Glasgow/NHS.

schelhorn commented 9 years ago

We'd like to have a look at it, too, +1

parlar commented 9 years ago

@ohofmann @schelhorn @chapmanb: Thanks for your interest. We're using it in production but I have some tidying up and documentation to take care of for repo. Will let you know when when things are ready.

chapmanb commented 9 years ago

Pär; Apologies about the delay in finalizing this but the latest development version of bcbio now avoids trying to "rescue" filtered variants for germline and annotates ensemble calls with a CALLERS tag that indicates the supporting callers. If you update with:

bcbio_nextgen.py upgrade -u development --tools

Hope this provides what you need and please let us know if you have any problems at all. Thanks again and looking forward to talking more about the reporting system when that's ready.

parlar commented 9 years ago

Muchas gracias! Will test asap!

Working on cleaning up the code for the reporting system and documentation. Will let you know as soon as things are reasonably tidy.