arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 118 forks source link

Extraction of the RD and AD genotype tags from input VCF #311

Open jsabeeler opened 10 years ago

jsabeeler commented 10 years ago

Hi,

I am having an issue with the extraction of the RD and AD genotype tags in my input VCF file to the 'gt_ref_depths' and 'gt_alt_depths' columns of my gemini database. I am using gemini (v0.8.0) on Ubuntu Server 12.04. The multisample VCF was generated using VarScan v2.3.7 and annotated with VEP.

Below is an example of the query I am running.

gemini query --header \
    -q "SELECT chrom, start, end, gene, ref, alt, \
        (gts).(*), (gt_depths).(*), (gt_ref_depths).(*), (gt_alt_depths).(*), (gt_quals).(*) FROM variants" \
            variants.vep.vcf.db 

Below is the output for the above query. The 'gts', 'gt_depths', and 'gt_quals' columns have been correctly imported from my input VCF file into the gemini database. The 'gt_ref_depths' and 'gt_alt_depths' columns in my gemini database are annotated with '-1' instead of the the corresponding values in the RD and AD genotype tags in my input VCF file.

chrom   start   end     gene    ref     alt     gts.2944-JP-9   gt_depths.2944-JP-9     gt_ref_depths.2944-JP-9 gt_alt_depths.2944-JP-9 gt_quals.2944-JP-9
chr1    14652   14653   DDX11L1 C       T       T/T     32      -1      -1      182.0
chr1    14676   14677   DDX11L1 G       A       G/A     45      -1      -1      63.0
chr1    14906   14907   DDX11L1 A       G       G/G     36      -1      -1      98.0
chr1    14929   14930   DDX11L1 A       G       G/G     44      -1      -1      143.0
chr1    14932   14933   DDX11L1 G       A       G/A     44      -1      -1      87.0

Below is the first 30 lines from my input VCF. Values for RD and AD genotype tags are present.

##fileformat=VCFv4.1
##source=VarScan2
##INFO=<ID=ADP,Number=1,Type=Integer,Description="Average per-sample depth of bases with Phred score >= 15">
##INFO=<ID=WT,Number=1,Type=Integer,Description="Number of samples called reference (wild-type)">
##INFO=<ID=HET,Number=1,Type=Integer,Description="Number of samples called heterozygous-variant">
##INFO=<ID=HOM,Number=1,Type=Integer,Description="Number of samples called homozygous-variant">
##INFO=<ID=NC,Number=1,Type=Integer,Description="Number of samples not called">
##FILTER=<ID=str10,Description="Less than 10% or more than 90% of variant supporting reads on one strand">
##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at this position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=SDP,Number=1,Type=Integer,Description="Raw Read Depth as reported by SAMtools">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Quality Read Depth of bases with Phred score >= 15">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
##FORMAT=<ID=PVAL,Number=1,Type=String,Description="P-value from Fisher's Exact Test">
##FORMAT=<ID=RBQ,Number=1,Type=Integer,Description="Average quality of reference-supporting bases (qual1)">
##FORMAT=<ID=ABQ,Number=1,Type=Integer,Description="Average quality of variant-supporting bases (qual2)">
##FORMAT=<ID=RDF,Number=1,Type=Integer,Description="Depth of reference-supporting bases on forward strand (reads1plus)">
##FORMAT=<ID=RDR,Number=1,Type=Integer,Description="Depth of reference-supporting bases on reverse strand (reads1minus)">
##FORMAT=<ID=ADF,Number=1,Type=Integer,Description="Depth of variant-supporting bases on forward strand (reads2plus)">
##FORMAT=<ID=ADR,Number=1,Type=Integer,Description="Depth of variant-supporting bases on reverse strand (reads2minus)">
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Consequence|Codons|Amino_acids|Gene|SYMBOL|Feature|EXON|PolyPhen|SIFT|Protein_position|BIOTYPE">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  2944-JP-10  2944-JP-9
1   14653   .   C   T   .   PASS    ADP=26;WT=0;HET=0;HOM=2;NC=0;CSQ=downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000456328|||||processed_transcript,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000488147|||||unprocessed_pseudogene,non_coding_exon_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000541675|9/9||||unprocessed_pseudogene,downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000450305|||||transcribed_unprocessed_pseudogene,downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000515242|||||transcribed_unprocessed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000538476|||||unprocessed_pseudogene,downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000518655|||||transcribed_unprocessed_pseudogene,non_coding_exon_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000438504|12/12||||unprocessed_pseudogene,non_coding_exon_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000423562|10/10||||unprocessed_pseudogene GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR    1/1:98:20:20:1:19:95%:1.5234E-10:41:38:1:0:17:2 1/1:182:32:32:0:32:100%:5.4567E-19:0:38:0:0:28:4
1   14677   .   G   A   .   PASS    ADP=37;WT=0;HET=2;HOM=0;NC=0;CSQ=downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000456328|||||processed_transcript,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000488147|||||unprocessed_pseudogene,non_coding_exon_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000541675|9/9||||unprocessed_pseudogene,downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000450305|||||transcribed_unprocessed_pseudogene,downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000515242|||||transcribed_unprocessed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000538476|||||unprocessed_pseudogene,downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000518655|||||transcribed_unprocessed_pseudogene,non_coding_exon_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000438504|12/12||||unprocessed_pseudogene,non_coding_exon_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000423562|10/10||||unprocessed_pseudogene GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR    0/1:65:30:30:13:17:56.67%:3.0928E-7:31:30:12:1:13:4 0/1:63:45:45:27:18:40%:4.5278E-7:33:36:27:0:13:5
1   14907   .   A   G   .   PASS    ADP=39;WT=0;HET=0;HOM=2;NC=0;CSQ=downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000456328|||||processed_transcript,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000488147|||||unprocessed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000541675|||||unprocessed_pseudogene,downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000450305|||||transcribed_unprocessed_pseudogene,downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000515242|||||transcribed_unprocessed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000538476|||||unprocessed_pseudogene,downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000518655|||||transcribed_unprocessed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000438504|||||unprocessed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000423562|||||unprocessed_pseudogene GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR    1/1:107:43:43:6:24:80%:1.647E-11:39:37:6:0:15:9 1/1:98:36:36:2:20:90.91%:1.3117E-10:31:38:2:0:14:6
1   14930   .   A   G   .   PASS    ADP=50;WT=0;HET=0;HOM=2;NC=0;CSQ=downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000456328|||||processed_transcript,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000488147|||||unprocessed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000541675|||||unprocessed_pseudogene,downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000450305|||||transcribed_unprocessed_pseudogene,downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000515242|||||transcribed_unprocessed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000538476|||||unprocessed_pseudogene,downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000518655|||||transcribed_unprocessed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000438504|||||unprocessed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000423562|||||unprocessed_pseudogene GT:GQ:SDP:DP:RD:AD:FREQ:PVAL

:RBQ:ABQ:RDF:RDR:ADF:ADR    1/1:152:56:56:9:34:79.07%:5.5428E-16:38:39:7:2:22:12    1/1:143:44:44:2:28:93.33%:4.194E-15:31:36:2:0:17:11
1   14933   .   G   A   .   PASS    ADP=50;WT=0;HET=2;HOM=0;NC=0;CSQ=downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000456328|||||processed_transcript,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000488147|||||unprocessed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000541675|||||unprocessed_pseudogene,downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000450305|||||transcribed_unprocessed_pseudogene,downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000515242|||||transcribed_unprocessed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000538476|||||unprocessed_pseudogene,downstream_gene_variant|||ENSG00000223972|DDX11L1|ENST00000518655|||||transcribed_unprocessed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000438504|||||unprocessed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000227232|WASH7P|ENST00000423562|||||unprocessed_pseudogene GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR    0/1:119:56:56:14:29:67.44%:1.1593E-12:37:38:10:4:18:11  0/1:87:44:44:9:21:70%:1.7919E-9:36:37:6:3:13:8

Any assistance would be appreciated. Thanks.

arq5x commented 10 years ago

This is the unfortunate consequence of the lack of a standard in the VCF format for reporting allele depths. The VCF parser that GEMINI uses currently supports ref and alt allele depths reported by GATK and FreeBayes (https://github.com/arq5x/cyvcf/blob/master/cyvcf/parser.pyx#L252-L308). We have not implemented support for VarScan. I am a bit fearful of trying to support the strategies employed by every variant caller (and I have raised this issue with the folks that define the VCF standard). I am about to leave for vacation but can try to tackle this for you when I return in a couple of weeks. Sorry for the trouble - it is a big problem in the VCF standard.

Aaron

jsabeeler commented 10 years ago

Thanks for the explanation Aaron. I am starting to work with VCF files and hadn't realized the VCF format's lack of standardization in this area. I understand your reservations about supporting the VCF output for multiple variant callers. This is not a critical issue for me, but would be a nice feature if it didn't require too much time to add.

Best, Scott

arq5x commented 10 years ago

No sweat. If you need it soon, you could update the code I referenced in CyVCF to support VarScan's convention. All you would need to to do is mimic the logic already there and then install that version of CyVCF on your system so that GEMINI uses it. I would glad incorporate your changes. Otherwise, I can try to knock it out when I return.

jsabeeler commented 10 years ago

I'll give it a try. I'm pretty new to Python, but this could be a good learning experience. As a quick workaround I ended up melting the VCF, extracting the RD and AD genotype tags, and converting from a long to wide table that I could use to annotate the GEMINI ouput.