bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
989 stars 354 forks source link

bcbio-nextgen / gemini integration #700

Closed parlar closed 9 years ago

parlar commented 9 years ago

In data from Gemini, loaded by bcbio-nextgen, I notice that multiallelic variants are reported with only a single genotype with a single variant. What happens to the other variant? Is this an issue with bcbio-nextgen, Gemini, or an issue at all?

chapmanb commented 9 years ago

Par; Thanks for raising this issue. The problem is that GEMINI doesn't currently handle multi-allelic variants. One approach we could take it to try to break up the multi-alleles into separate items before feeding to GEMINI:

https://github.com/ekg/vcflib#vcfbreakmulti

This is not perfect since it results in slightly different representations between the VCF and GEMINI, but would at least avoid dropping variants. What do you think?

parlar commented 9 years ago

Hi again, Brad!

Then I guess it should be better to break up the multi-alleles before loading them into Gemini, as you suggest. But perhaps to keep the plain vcfs intact, for other tools that can handle multi-allelic variants correctly?

//Pär

parlar commented 9 years ago

Hi again,

We are eager to move the bcbio-nextgen pipeline into production in our clinical genetics lab. One principal issues are the "gender problem" where variation can be reported as heterozygotes despite being present on chrX in males. Another other issue concerns Gemini's inability to handle multi-allelic states. The "gender problem" we can handle/cope with. The Gemini issue, however, requires fixing. I realize that you are a very busy person, can I help you in any way with this? What needs to be done, I guess, is to simply generate temporary vcfs using breakmulti and use those for import into gemini? And then remove the temporary vcfs afterwards?

chapmanb commented 9 years ago

Pär; Thanks much for the patience and helping prioritize the outstanding requests. You have a lot of good requests but all of them are more in-depth development projects so it helps to know which ones are blocking your usage of bcbio.

I pushed an update which will decompose multi-allelic inputs into bi-allelic prior to feeding into GEMINI. I evaluated 4 different approaches to do this and vt's (https://github.com/atks/vt) approach handles the widest range of inputs and correctly resets FORMAT/genotype likelihoods so they match the bi-allelic state. It doesn't re-do other FORMAT annotations like depth, so these are all removed as to not be incorrect in the resulting output.

Hope this works for your usage. Please let us know if you run into any issues and thanks again.