arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 118 forks source link

DP field not being parsed into gemini database for some variants #880

Closed mbootwalla closed 6 years ago

mbootwalla commented 6 years ago

Hello Gemini Team,

I am running into a strange problem when loading vcfs into gemini. The vcf that I have is generated using the bcbio ensemble tool and is based off of vcfs from freebayes, platypus, samtools and varscan for a given sample. From what I have been able to figure out the bcbio ensemble tool picks a variant from the first vcf file alphabetically in which it is present and meets the ensemble criteria. In most cases this is the freebayes vcf but in some cases it is either the platypus or the samtools vcf. For variants picked from the freebayes vcf the DP field is correctly parsed and entered into gemini's variant database's gt_depths field but for variants that come from the samtools or platypus vcfs it is unable to do so and this results in the gt_depths value being populated as -1. This causes a problem downstream when I am trying to filter out variants based on their depth of coverage.

I am aware of why the parsing issue happens with the platypus vcf. This is because platypus does not use the DP field to represent the depth of coverage. Instead it uses a field called NR in the FORMAT column. I will fix this for variants coming from platypus by adding the DP and AD fields to the FORMAT column. But for the variants coming from the samtools vcf there is a DP field in the FORMAT column and still those variants are incorrectly parsed and the gt_depths field is populated as -1. Can you kindly fix this behavior so that the genotype_depths field is correctly populated?

Version of gemini used is 0.20.1 that is bundled with bcbio_nextgen version 1.0.5a0

Here is an example of this:

Original VCF entry: 1 12856111 . C T 21.54 PASS AC=1;AN=2;BQB=0.995576;CALLERS=samtools,varscan;DP=105;DP4=10,64,5,16;HOB=0.5;ICB=1;MQ=25;MQ0F=0.114286;MQB=0.603805;MQSB=0.889129;RPB=0.999638;SGB=-0.692352;VDB=0.332872 GT:PL:DP 0/1:56,0,223:95

Entry in gemini database for this variant: chrom start ref alt qual filter type aaf_exac_all aaf_gnomad_all gts.S1 gt_depths.S1 gt_ref_depths.S1 gt_alt_depths.S1 chr1 12856110 C T 21.5400009155 None snp 0.484 0.488278 C/T -1 -1 -1

Thanks,

Moiz

brentp commented 6 years ago

cyvcf2 and therefore gemini use either AD or the RO and AO fields as defined in the VCF spec. Since your VCF doesn't have either of those, it is not supported. You can either artificially add those or use a different caller.

mbootwalla commented 6 years ago

Hi Brent,

Thank you for your response.

I had a further query regarding the DP field. Why isn't the DP tag in the FORMAT field being used? The FORMAT/DP value is the sum of AO and RO and in cases where those values are not provided by the caller the DP field can be used to atleast correctly provide the gt_depths value. This is the understanding I got from the Gemini documentation where it says that gt_depths value is obtained from the DP field and ref and alt depths are obtained from the AD field. Since the DP tag is present for each sample can't that be used to obtain the gt_depths field? I don't really care about the ref and alt depths. The gt_depths field is the main one. Can this be accomplished through cyvcf2? This would be tremendously helpful if the functionality could be provided.

Thanks,

Moiz