Closed drmjc closed 9 years ago
Ah, this may be easier than I thought. My understanding was that you wanted to be able to create a VCF with a complex INFO field populated by all of the columns that were in the query.
I've just had a closer look at your db schema, to keep this feature request as simple as possible.
chrom, start, vcf_id, ref, alt, qual, info, filter, gts
are already tracked in the variants table.
I see you store the original INFO
field as a BLOB, which is an unexpected bonus, though its easy to see how this data would get out of sync if you start subsetting samples. Thus its perhaps not worth including the real INFO
data, or flagging it in the docs as 'the original' INFO
.
This should already be useful as input for downstream prioritisation tools based on phenotype (eg PhenIX or exomiser).
I could help with reverse engineering the FORMAT:AD
, FORMAT:DP
and FORMAT:GQ
fields from the relevant gt* fields.
+1
+1
I definitely agree this would be very useful. @brentp - any interest in working on this as your first contribution to GEMINI? ;)
I took a first pass on this, based on a discussion today with Mark Cowley, Tony Roscioli and others. The --format vcf
option is now operational, albeit with a few caveats that are laid out below. Consider the following VCF:
cat vcfspec.4.2.vcf
##fileformat=VCFv4.2
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T A 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Once loaded, I can query the database and get something very close to the original:
# load
gemini load -v vcfspec.4.2.vcf vcfspec.4.2.vcf.db
# query with vcf output
gemini query --format vcf -q "select 1 from variants" vcfspec.4.2.vcf.db
chr20 14370 rs6054257 G A 29.0 PASS H2=True;NS=3;DB=True;DP=14;AF=0.5 GT 0|0 1|0 1/1
chr20 17330 . T A 3.0 q10 NS=3;DP=11;AF=0.017 GT 0|0 0|1 0/0
chr20 1110696 rs6040355 A G,T 67.0 PASS AA=T;NS=2;DB=True;DP=10;AF=0.333,0.667 GT 1|0 1|0 1/1
chr20 1230237 . T A 47.0 PASS AA=T;NS=3;DP=13 GT 0|0 0|0 0/0
chr20 1234567 microsat1 GTC G,GTCT 50.0 PASS AA=G;NS=3;DP=9 GT 0/1 0/1 1/1
# add the header
gemini query --header --format vcf -q "select 1 from variants" vcfspec.4.2.vcf.db
##fileformat=VCFv4.2
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
chr20 14370 rs6054257 G A 29.0 PASS H2=True;NS=3;DB=True;DP=14;AF=0.5 GT 0|0 1|0 1/1
chr20 17330 . T A 3.0 q10 NS=3;DP=11;AF=0.017 GT 0|0 0|1 0/0
chr20 1110696 rs6040355 A G,T 67.0 PASS AA=T;NS=2;DB=True;DP=10;AF=0.333,0.667 GT 1|0 1|0 1/1
chr20 1230237 . T A 47.0 PASS AA=T;NS=3;DP=13 GT 0|0 0|0 0/0
chr20 1234567 microsat1 GTC G,GTCT 50.0 PASS AA=G;NS=3;DP=9 GT 0/1 0/1 1/1
The key differences of note are:
While not perfect, I think this may serve as a good start for passing query results to downstream tools that require VCF as input.
Thanks Aaron, you did say you were itching to get back to coding! So much more fun than grants right? This is an excellent start! We'll test the feature and report back
Cheers, Mark
Thanks so much for jumping in and doing this, Aaron, that is awesome.
Well, I should have done it sooner. It is indeed useful, ableit imperfect. I think maintaining the two biggest issues with the functionality is that (1), the genotypes for variants with multiple alleles are incorrect and (2), the INFO field's original order is not preserved (pickled as a dictionary instead of an ordered dictionary). I will be interested to see how useful the current output is...
I agree that this is highly useful. Thanks for implementing it.
Re the two outstanding issues, I think (2) doesn’t matter. According the VCF 4.x spec(s), the order of the fields doesn’t matter. Most GATK tools reorder them alphabetically.
(1) on the other hand is important. It is however not an easy problem. Or rather, it should be (but isn’t, long reason below) a fairly straight forward issue of checking in the VCF header what the count type is,
Number=1 = a single value for all given alleles Number=A = one value for each ALT allele Number=. = unbounded/unknown number of values (ex mapped gene, can be 1, can be 2 overlapping) Number=R (new in VCF4.2 spec) one value per allele including ref
and select the appropriate value from the attribute value array.
The issue is that not all VCFs adhere to the spec.
ExAC v0.3 says that DP_HIST and GQHIST are of type A, when in fact they should be of type R. A set of attributes named Het* are of type A in the header, but should be “.” (unbounded). (this will be fixed, https://github.com/konradjk/exac_browser/issues/164).
In ClinVar v20150203, the CLNSIG attribute is set to type “.” in the header. For most multiallelic variants, there is one field per ALT allele (which makes sense - different alleles can have different effect). But not all. Some multiallelic variants have a single CLNSIG value (for the curious, see 13:32911658). How should that be interpreted?
I don’t mean to bash on the great people who made all these resources available to the public, they’ve done a terrific job.
The point is that they are hard if not impossible to handle correctly when the external VCFs either do not adhere to spec, or “lazily” set "Number=.” which makes mapping to individual alleles impossible.
Oh, and the there’s the issue with matching called ref/alt alleles to external resources when there are both INSs and DELs in the same position. Not impossible, just challenging… ;)
Personally, I split external resource VCFs so there’s a single ALT allele on each line, and then left align and trim the variants. Doing the same on my called variants solves this, makes for strange genotypes that can be split over multiple rows (suboptimal, but ensures correct mapping of external annotations).
cheers Daniel
On 11 Feb 2015, at 14:02, Aaron Quinlan notifications@github.com wrote:
Well, I should have done it sooner, as it is indeed useful, ableit imperfect. I think maintaining the two biggest issues with the functionality is that (1), the genotypes for variants with multiple alleles are incorrect and (2), the INFO field's original order is not preserved (pickled as a dictionary instead of an ordered dictionary). I will be interested to see how useful the current output is...
— Reply to this email directly or view it on GitHub https://github.com/arq5x/gemini/issues/346#issuecomment-73877285.
good points @dakl. Do you mind sharing which tools you're using for splitting, left aligning and trimming variants? vcfallelicprimitives?
@chapmanb has something nice to normalize the variant file that might get you to where you want to be:
I'm using used a combination of bcftools norm
and GATK LeftAlignAndTrimVariants
:
bcftools norm -m - input.vcf.gz | bgzip > split.vcf.gz
tabix -p vcf split.vcf.gz
java -jar $GATKJAR -T LeftAlignAndTrimVariants -R $REF -V split.vcf.gz -o split-left.vcf.gz -trim
Note the -trim
parameter in the GATK cmd which will trim indels.
Example input.vcf.gz:
17 17697093 . CCAGCAGCAGCAG CCAGCAGCAG,C . . AC=1,1;AF=0.125,0.125;AN=8;set=variant75-variant82 GT ./.
split.vcf.gz
17 17697093 . CCAGCAGCAGCAG CCAGCAGCAG . . AC=1;AF=0.125;AN=8;set=variant75-variant82 GT ./.
17 17697093 . CCAGCAGCAGCAG C . . AC=1;AF=0.125;AN=8;set=variant75-variant82 GT ./.
split-left.vcf.gz:
17 17697093 . CCAG C . . AC=1;AF=0.125;AN=8;set=variant75-variant82 GT ./.
17 17697093 . CCAGCAGCAGCAG C . . AC=1;AF=0.125;AN=8;set=variant75-variant82 GT ./.
I'll test bcbio.variation as well.
Btw, bcftools splits the INFO field correctly but drops any sample columns, so it's useful for annotation/resource VCFs only. For splitting of genotypes I use https://github.com/moonso/vcf_parser which is fairly slow (a lot slower than bcftools) but this is usually no problem since I run it on a single sample or a T/N pair with a relatively limited number of variants. (vcf_parser works nicely on resource vcfs too, but is way slower for boatloads of variants)
GATK LeftAlignAndTrimVariants has a -split
option which splits multiallelic variants, but it doesn't split the INFO field correctly, and sets all genotypes for split variants to ./.
.
thanks for the detailed code and pointers!
Closing, as in my opinion, this has been resolved.
Motivated by your
--format tped
or--format json
feature, we'd like to request a--format vcf
feature. This would be extremely useful for using gemini to filter to a subset of relevant variants, which can then be used in downstream prioritisation tools. The resulting VCF doesn't necessarily need to contain all theINFO
andFORMAT
fields that would have been in the imported VCF, but at least a valid header andGT
would be a great starting point. Of courseDP
,GQ
andFILTER=PASS
etc would be great too.cheers, Mark C & Tony Roscioli