arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

Enhancement Request: --format vcf #346

Closed drmjc closed 9 years ago

drmjc commented 9 years ago

Motivated by your --format tped or --format json feature, we'd like to request a --format vcf feature. This would be extremely useful for using gemini to filter to a subset of relevant variants, which can then be used in downstream prioritisation tools. The resulting VCF doesn't necessarily need to contain all the INFO and FORMAT fields that would have been in the imported VCF, but at least a valid header and GT would be a great starting point. Of course DP, GQ and FILTER=PASS etc would be great too.

cheers, Mark C & Tony Roscioli

arq5x commented 9 years ago

Ah, this may be easier than I thought. My understanding was that you wanted to be able to create a VCF with a complex INFO field populated by all of the columns that were in the query.

drmjc commented 9 years ago

I've just had a closer look at your db schema, to keep this feature request as simple as possible. chrom, start, vcf_id, ref, alt, qual, info, filter, gts are already tracked in the variants table. I see you store the original INFO field as a BLOB, which is an unexpected bonus, though its easy to see how this data would get out of sync if you start subsetting samples. Thus its perhaps not worth including the real INFO data, or flagging it in the docs as 'the original' INFO.

This should already be useful as input for downstream prioritisation tools based on phenotype (eg PhenIX or exomiser).

I could help with reverse engineering the FORMAT:AD, FORMAT:DP and FORMAT:GQ fields from the relevant gt* fields.

cc2qe commented 9 years ago

+1

dakl commented 9 years ago

+1

arq5x commented 9 years ago

I definitely agree this would be very useful. @brentp - any interest in working on this as your first contribution to GEMINI? ;)

arq5x commented 9 years ago

I took a first pass on this, based on a discussion today with Mark Cowley, Tony Roscioli and others. The --format vcf option is now operational, albeit with a few caveats that are laid out below. Consider the following VCF:

cat vcfspec.4.2.vcf
##fileformat=VCFv4.2
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20  14370   rs6054257   G   A   29  PASS    NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51  1|0:48:8:51,51  1/1:43:5:.,.
20  17330   .   T   A   3   q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50  0|1:3:5:65,3    0/0:41:3
20  1110696 rs6040355   A   G,T 67  PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB   GT:GQ:DP:HQ 1|2:21:6:23,27  2|1:2:0:18,2    2/2:35:4
20  1230237 .   T   A   47  PASS    NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60  0|0:48:4:51,51  0/0:61:2
20  1234567 microsat1   GTC G,GTCT  50  PASS    NS=3;DP=9;AA=G  GT:GQ:DP    0/1:35:4    0/2:17:2    1/1:40:3

Once loaded, I can query the database and get something very close to the original:

# load
gemini load -v vcfspec.4.2.vcf vcfspec.4.2.vcf.db

 # query with vcf output
gemini query --format vcf -q "select 1 from variants" vcfspec.4.2.vcf.db
chr20   14370   rs6054257   G   A   29.0    PASS    H2=True;NS=3;DB=True;DP=14;AF=0.5   GT  0|0 1|0 1/1
chr20   17330   .   T   A   3.0 q10 NS=3;DP=11;AF=0.017 GT  0|0 0|1 0/0
chr20   1110696 rs6040355   A   G,T 67.0    PASS    AA=T;NS=2;DB=True;DP=10;AF=0.333,0.667  GT  1|0 1|0 1/1
chr20   1230237 .   T   A   47.0    PASS    AA=T;NS=3;DP=13 GT  0|0 0|0 0/0
chr20   1234567 microsat1   GTC G,GTCT  50.0    PASS    AA=G;NS=3;DP=9  GT  0/1 0/1 1/1

# add the header
gemini query --header --format vcf -q "select 1 from variants" vcfspec.4.2.vcf.db
##fileformat=VCFv4.2
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003    
chr20   14370   rs6054257   G   A   29.0    PASS    H2=True;NS=3;DB=True;DP=14;AF=0.5   GT  0|0 1|0 1/1
chr20   17330   .   T   A   3.0 q10 NS=3;DP=11;AF=0.017 GT  0|0 0|1 0/0
chr20   1110696 rs6040355   A   G,T 67.0    PASS    AA=T;NS=2;DB=True;DP=10;AF=0.333,0.667  GT  1|0 1|0 1/1
chr20   1230237 .   T   A   47.0    PASS    AA=T;NS=3;DP=13 GT  0|0 0|0 0/0
chr20   1234567 microsat1   GTC G,GTCT  50.0    PASS    AA=G;NS=3;DP=9  GT  0/1 0/1 1/1

The key differences of note are:

  1. Chromosome labels are prefxed with "chr". Fixable.
  2. QUAL scores have decomal precision. Difficult.
  3. The INFO field is in a different order (this could perhaps be fixed with an OrderedDict)
  4. The original genotype formats are culled to simply the GT.
  5. Genotypes for variants with multiple alleles are not properly handled. This may only be resolved once proper support for multiple alleles is added to GEMINI.
  6. It clearly provides no support for updating the original INFO field with GEMINI annotations. Futire work.

While not perfect, I think this may serve as a good start for passing query results to downstream tools that require VCF as input.

drmjc commented 9 years ago

Thanks Aaron, you did say you were itching to get back to coding! So much more fun than grants right? This is an excellent start! We'll test the feature and report back

Cheers, Mark

roryk commented 9 years ago

Thanks so much for jumping in and doing this, Aaron, that is awesome.

arq5x commented 9 years ago

Well, I should have done it sooner. It is indeed useful, ableit imperfect. I think maintaining the two biggest issues with the functionality is that (1), the genotypes for variants with multiple alleles are incorrect and (2), the INFO field's original order is not preserved (pickled as a dictionary instead of an ordered dictionary). I will be interested to see how useful the current output is...

dakl commented 9 years ago

I agree that this is highly useful. Thanks for implementing it.

Re the two outstanding issues, I think (2) doesn’t matter. According the VCF 4.x spec(s), the order of the fields doesn’t matter. Most GATK tools reorder them alphabetically.

(1) on the other hand is important. It is however not an easy problem. Or rather, it should be (but isn’t, long reason below) a fairly straight forward issue of checking in the VCF header what the count type is,

Number=1 = a single value for all given alleles Number=A = one value for each ALT allele Number=. = unbounded/unknown number of values (ex mapped gene, can be 1, can be 2 overlapping) Number=R (new in VCF4.2 spec) one value per allele including ref

and select the appropriate value from the attribute value array.

The issue is that not all VCFs adhere to the spec.

ExAC v0.3 says that DP_HIST and GQHIST are of type A, when in fact they should be of type R. A set of attributes named Het* are of type A in the header, but should be “.” (unbounded). (this will be fixed, https://github.com/konradjk/exac_browser/issues/164).

In ClinVar v20150203, the CLNSIG attribute is set to type “.” in the header. For most multiallelic variants, there is one field per ALT allele (which makes sense - different alleles can have different effect). But not all. Some multiallelic variants have a single CLNSIG value (for the curious, see 13:32911658). How should that be interpreted?

I don’t mean to bash on the great people who made all these resources available to the public, they’ve done a terrific job.

The point is that they are hard if not impossible to handle correctly when the external VCFs either do not adhere to spec, or “lazily” set "Number=.” which makes mapping to individual alleles impossible.

Oh, and the there’s the issue with matching called ref/alt alleles to external resources when there are both INSs and DELs in the same position. Not impossible, just challenging… ;)

Personally, I split external resource VCFs so there’s a single ALT allele on each line, and then left align and trim the variants. Doing the same on my called variants solves this, makes for strange genotypes that can be split over multiple rows (suboptimal, but ensures correct mapping of external annotations).

cheers Daniel

On 11 Feb 2015, at 14:02, Aaron Quinlan notifications@github.com wrote:

Well, I should have done it sooner, as it is indeed useful, ableit imperfect. I think maintaining the two biggest issues with the functionality is that (1), the genotypes for variants with multiple alleles are incorrect and (2), the INFO field's original order is not preserved (pickled as a dictionary instead of an ordered dictionary). I will be interested to see how useful the current output is...

— Reply to this email directly or view it on GitHub https://github.com/arq5x/gemini/issues/346#issuecomment-73877285.

drmjc commented 9 years ago

good points @dakl. Do you mind sharing which tools you're using for splitting, left aligning and trimming variants? vcfallelicprimitives?

roryk commented 9 years ago

@chapmanb has something nice to normalize the variant file that might get you to where you want to be:

https://github.com/chapmanb/bcbio.variation

dakl commented 9 years ago

I'm using used a combination of bcftools norm and GATK LeftAlignAndTrimVariants:

bcftools norm -m - input.vcf.gz | bgzip > split.vcf.gz 
tabix -p vcf split.vcf.gz 
java -jar $GATKJAR -T LeftAlignAndTrimVariants -R $REF -V split.vcf.gz -o split-left.vcf.gz -trim

Note the -trim parameter in the GATK cmd which will trim indels.

Example input.vcf.gz:

17  17697093    .   CCAGCAGCAGCAG   CCAGCAGCAG,C    .   .   AC=1,1;AF=0.125,0.125;AN=8;set=variant75-variant82  GT  ./.

split.vcf.gz

17  17697093    .   CCAGCAGCAGCAG   CCAGCAGCAG  .   .   AC=1;AF=0.125;AN=8;set=variant75-variant82  GT  ./.
17  17697093    .   CCAGCAGCAGCAG   C   .   .   AC=1;AF=0.125;AN=8;set=variant75-variant82  GT  ./.

split-left.vcf.gz:

17  17697093    .   CCAG    C   .   .   AC=1;AF=0.125;AN=8;set=variant75-variant82  GT  ./.
17  17697093    .   CCAGCAGCAGCAG   C   .   .   AC=1;AF=0.125;AN=8;set=variant75-variant82  GT  ./.

I'll test bcbio.variation as well.

dakl commented 9 years ago

Btw, bcftools splits the INFO field correctly but drops any sample columns, so it's useful for annotation/resource VCFs only. For splitting of genotypes I use https://github.com/moonso/vcf_parser which is fairly slow (a lot slower than bcftools) but this is usually no problem since I run it on a single sample or a T/N pair with a relatively limited number of variants. (vcf_parser works nicely on resource vcfs too, but is way slower for boatloads of variants)

GATK LeftAlignAndTrimVariants has a -split option which splits multiallelic variants, but it doesn't split the INFO field correctly, and sets all genotypes for split variants to ./..

drmjc commented 9 years ago

thanks for the detailed code and pointers!

arq5x commented 9 years ago

Closing, as in my opinion, this has been resolved.