vcf2db "sqlalchemy.exc.InterfaceError (sqlite3.InterfaceError) Error binding parameter 72 - probably unsupported type." - caused by a specific variant record in the vcf file

bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

https://bcbio-nextgen.readthedocs.io

MIT License

994 stars 354 forks source link

vcf2db "sqlalchemy.exc.InterfaceError (sqlite3.InterfaceError) Error binding parameter 72 - probably unsupported type." - caused by a specific variant record in the vcf file #2567

Closed nvteja closed 5 years ago

nvteja commented 6 years ago

Hello,

I was running bcbio v1.1.0 on a set of ~90 samples and I noticed 3 of those samples failed at the same step (vcf2db) with the same exact error. Upon further inspection I noticed that this error is caused by the same variant in all these 3 samples, and removing that variant from these vcf files seems to bypass the error (i.e. vcf2db finished successfully). I am attaching the error log, and a slice of the vcf file including the variant here (line 171 in the attached vcf file).

Command used: vcf2db.py C3_4-gatk-haplotype-nomultiallelic-annotated-gemini.vcf.gz C3_4-gatk-haplotype-nomultiallelic.ped C3_4-gatk-haplotype.db

BCBIO LOG: bcbio-nextgen.log

CONFIG FILE: C3_4.yaml.txt

VCF FILE: (variant causing the error is on line 171) C3_4-gatk-haplotype-nomultiallelic-annotated-gemini.vcf.txt

PED FILE: C3_4-gatk-haplotype-nomultiallelic.ped.txt

Could you please help me with this error?

Thank you, Teja.

chapmanb commented 6 years ago

Teja; Thanks for the report and sorry about the issue. I'll have to bring in @brentp to help with this one. I see the problem, but not sure about the best solution. This is the failing VCF line:

1   152327436   .   TGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC    T   1538.1  PASS    AC=1;AF=0.167;AN=6;BaseQRankSum=0.428;ClippingRankSum=0;DP=2838;ExcessHet=3.0103;FS=5.425;MLEAC=1;MLEAF=0.167;MQ=58.61;MQ0=0;MQRankSum=-5.847;QD=7.81;ReadPosRankSum=-0.91;SOR=1.273;CSQ=-|inframe_deletion|MODERATE|FLG2|ENSG00000143520|Transcript|ENST00000388718|protein_coding|3/3||ENST00000388718.5:c.2595_2825del|ENSP00000373370.4:p.Ser869_Thr945del|2668-2898/9124|2595-2825/7176|865-942/2391|SGQTSGFGQHRSSSGQYSGFGQHGSGSGQSSGFGQHGTGSGQYSGFGQHESRSHQSSYGQHGSGSSQSSGYGQHGSSS/S|tcGGGACAGACATCTGGCTTTGGACAACACAGGTCAAGCTCAGGTCAATACTCTGGCTTTGGACAACATGGATCAGGCTCAGGTCAGTCCAGTGGCTTTGGACAACATGGGACTGGCTCAGGACAATACTCTGGTTTTGGACAACATGAGTCTAGATCACATCAGTCTAGCTATGGCCAACATGGTTCTGGCTCAAGTCAGTCATCTGGCTATGGTCAACATGGGTCAAGTTCa/tca||1||-1||deletion|HGNC|33276|YES|||CCDS30861.1|ENSP00000373370|Q5D862||UPI00004E1DE5||Ensembl|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|||||hmmpanther:PTHR22571&hmmpanther:PTHR22571:SF24&Low_complexity_(Seg):seg&Low_complexity_(Seg):seg&Low_complexity_(Seg):seg||||||||||||||||||||||||||||||||,-|intron_variant&non_coding_transcript_variant|MODIFIER|FLG-AS1|ENSG00000237975|Transcript|ENST00000392688|antisense||6/6|ENST00000392688.2:n.1407-8994_1407-8764del||||||||1||1||deletion|HGNC|27913|YES|||||||||Ensembl|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC||||||32|||||||||||||||||||||||||||||||,-|intron_variant&non_coding_transcript_variant|MODIFIER|FLG-AS1|ENSG00000237975|Transcript|ENST00000445097|antisense||2/2|ENST00000445097.1:n.152-8994_152-8764del||||||||1||1||deletion|HGNC|27913||||||||||Ensembl|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC||||||32|||||||||||||||||||||||||||||||,-|inframe_deletion|MODERATE|FLG2|388698|Transcript|NM_001014342.2|protein_coding|3/3||NM_001014342.2:c.2595_2825del|NP_001014364.1:p.Ser869_Thr945del|2668-2898/9122|2595-2825/7176|865-942/2391|SGQTSGFGQHRSSSGQYSGFGQHGSGSGQSSGFGQHGTGSGQYSGFGQHESRSHQSSYGQHGSGSSQSSGYGQHGSSS/S|tcGGGACAGACATCTGGCTTTGGACAACACAGGTCAAGCTCAGGTCAATACTCTGGCTTTGGACAACATGGATCAGGCTCAGGTCAGTCCAGTGGCTTTGGACAACATGGGACTGGCTCAGGACAATACTCTGGTTTTGGACAACATGAGTCTAGATCACATCAGTCTAGCTATGGCCAACATGGTTCTGGCTCAAGTCAGTCATCTGGCTATGGTCAACATGGGTCAAGTTCa/tca||1||-1||deletion|EntrezGene|33276|YES||||NP_001014364.1|||||RefSeq|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|OK||||||||||||||||||||||||||||||||||||,-|intron_variant&non_coding_transcript_variant|MODIFIER|FLG-AS1|339400|Transcript|NR_103778.1|misc_RNA||6/6|NR_103778.1:n.1407-8994_1407-8764del||||||||1||1||deletion|EntrezGene|27913|YES|||||||||RefSeq|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC||||||32|||||||||||||||||||||||||||||||,-|intron_variant&non_coding_transcript_variant|MODIFIER|FLG-AS1|339400|Transcript|NR_103779.1|misc_RNA||2/2|NR_103779.1:n.152-8994_152-8764del||||||||1||1||deletion|EntrezGene|27913||||||||||RefSeq|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC||||||32|||||||||||||||||||||||||||||||;gnomad_ac_es=2;gnomad_hom_es=0;gnomad_af_es=0.00011256;gnomad_an_es=245514,225528;gnomad_ac_gs=78;gnomad_af_gs=0.016779;gnomad_an_gs=30504;gnomad_ac=80;gnomad_hom=0;gnomad_af_popmax=0.016779;gnomad_an=30504;gnomad_af=0.0026226  GT:AD:DP:GQ:PL  0/1:34,163:197:99:1578,0,668    0/0:539,108:647:99:0,690,21261  0/0:1370,6:1376:99:0,4080,57462

and it looks like the issue is that gnomad_an_es is a tuple of two values, rather than an expected single value (245514, 225528). Brent, is there a way to handle this case in vcf2db, or should we be trying to work around with our vcfanno recipe? Thanks for any thoughts/suggestions about how best to tackle.

brentp commented 6 years ago

was vcfanno run on this before it was decomposed? and/or what were the parameters used to run vcfanno? I don't see how this would be getting annotated with 2 values.

chapmanb commented 6 years ago

Brent and Teja; I did some GitHub searching and realized this is probably coming from this custom vcfanno script:

https://github.com/naumenko-sa/cre/blob/8fe4efc63b0ce2ad0825f547daa25bf9c5b69f81/cre.vcfanno.conf#L4

My suggestion would be to swap that to use first rather than self for the ops which should only annotate with a single count for this case (where gnomad apparently has multiple alleles). Would that be a reasonable workaround?

nvteja commented 6 years ago

Thank you for the reply @chapmanb and @brentp

Based on the command log, it looks like the vcf was decomposed before vcfanno. I am attaching the command log and vcfanno conf here (I shortened file paths in the commands for readability).

bcbio-nextgen-commands.log cre.vcfanno.lua.txt

Thank you, Teja

nvteja commented 6 years ago

Oh Sorry @chapmanb , just saw your response. I will try your suggestion and will let you know if it works. Also, tagging my colleague @naumenko-sa

brentp commented 6 years ago

hi @nvteja , based on your logs, I would add sed -e 's/Number=A/Number=1/g' to change your VCF before running vcfanno (and after decomposing).

naumenko-sa commented 6 years ago

Thank you all!

The source of the problem seems to be that in the gnomad Exome vcf bcbio/gemini_data/gnomad.exomes.r2.0.1.sites.no-VEP.nohist.tidy.vcf.gz there are two identical (chr-pos-alt-ref) variants at 1:152327436 (the first and the third). And they have different AN values in the INFO tags. That is why vcfanno returns a tuple. Probably it is gnomad.exomes bug - because we don't expect duplicates of the same allele after decomposition and normalization?

1   152327436   .   TGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC    T   4.02656e+07 PASS    AC=8125;AF=0.0330938;AN=245514;BaseQRankSum=1.9;ClippingRankSum=0.048;DP=25137064;FS=0;InbreedingCoeff=0.7984;MQ=56.61;MQRankSum=1.25;QD=20.29;ReadPosRankSum=0.985;SOR=0.704;VQSLOD=2.18;VQSR_culprit=FS;AC_AFR=12;AC_AMR=2398;AC_ASJ=404;AC_EAS=2473;AC_FIN=124;AC_NFE=1160;AC_OTH=198;AC_SAS=1356;AC_Male=4221;AC_Female=3904;AN_AFR=15292;AN_AMR=33356;AN_ASJ=9826;AN_EAS=17098;AN_FIN=22292;AN_NFE=111504;AN_OTH=5466;AN_SAS=30680;AN_Male=134540;AN_Female=110974;AF_AFR=0.000784724;AF_AMR=0.0718911;AF_ASJ=0.0411154;AF_EAS=0.144637;AF_FIN=0.00556253;AF_NFE=0.0104032;AF_OTH=0.0362239;AF_SAS=0.0441982;AF_Male=0.0313736;AF_Female=0.0351794;AC_raw=8495;AN_raw=246272;AF_raw=0.0344944;Hom_AFR=4;Hom_AMR=1015;Hom_ASJ=188;Hom_EAS=1042;Hom_FIN=59;Hom_NFE=485;Hom_OTH=82;Hom_SAS=584;Hom_Male=1804;Hom_Female=1655;Hom_raw=3524;Hom=3459;POPMAX=EAS;AC_POPMAX=2473;AN_POPMAX=17098;AF_POPMAX=0.144637;DP_MEDIAN=190;DREF_MEDIAN=0;GQ_MEDIAN=99;AB_MEDIAN=0.851619;AS_RF=0.934857;AS_FilterStatus=PASS;OLD_MULTIALLELIC=1:152327436:TGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC/T/CGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC
1   152327436   .   T   C   4.02656e+07 PASS    AC=0;AF=0;AN=245514;BaseQRankSum=1.9;ClippingRankSum=0.048;DP=25137064;FS=0;InbreedingCoeff=0.7984;MQ=56.61;MQRankSum=1.25;QD=20.29;ReadPosRankSum=0.985;SOR=0.704;VQSLOD=2.18;VQSR_culprit=FS;AC_AFR=0;AC_AMR=0;AC_ASJ=0;AC_EAS=0;AC_FIN=0;AC_NFE=0;AC_OTH=0;AC_SAS=0;AC_Male=0;AC_Female=0;AN_AFR=15292;AN_AMR=33356;AN_ASJ=9826;AN_EAS=17098;AN_FIN=22292;AN_NFE=111504;AN_OTH=5466;AN_SAS=30680;AN_Male=134540;AN_Female=110974;AF_AFR=0;AF_AMR=0;AF_ASJ=0;AF_EAS=0;AF_FIN=0;AF_NFE=0;AF_OTH=0;AF_SAS=0;AF_Male=0;AF_Female=0;AC_raw=5;AN_raw=246272;AF_raw=2.03028e-05;Hom_AFR=0;Hom_AMR=0;Hom_ASJ=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;Hom_Male=0;Hom_Female=0;Hom_raw=0;Hom=0;POPMAX=.;AC_POPMAX=.;AN_POPMAX=.;AF_POPMAX=.;DP_MEDIAN=210;DREF_MEDIAN=1.5849e-40;GQ_MEDIAN=99;AB_MEDIAN=0.108696;AS_RF=0.164612;AS_FilterStatus=AC0;OLD_MULTIALLELIC=1:152327436:TGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC/T/CGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC;OLD_VARIANT=1:152327436:TGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC/CGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC
1   152327436   rs76514540  TGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC    T   1.45778e+08 PASS    AC=2;AF=8.86808e-06;AN=225528;BaseQRankSum=-2.754;ClippingRankSum=0.003;DP=32448557;FS=0;InbreedingCoeff=0.1146;MQ=58.42;MQRankSum=-0.275;QD=10.94;ReadPosRankSum=-0.384;SOR=0.721;VQSLOD=4.72;VQSR_culprit=FS;AC_AFR=0;AC_AMR=0;AC_ASJ=1;AC_EAS=0;AC_FIN=0;AC_NFE=1;AC_OTH=0;AC_SAS=0;AC_Male=1;AC_Female=1;AN_AFR=15028;AN_AMR=29174;AN_ASJ=8884;AN_EAS=14190;AN_FIN=21414;AN_NFE=104486;AN_OTH=4962;AN_SAS=27390;AN_Male=123716;AN_Female=101812;AF_AFR=0;AF_AMR=0;AF_ASJ=0.000112562;AF_EAS=0;AF_FIN=0;AF_NFE=9.57066e-06;AF_OTH=0;AF_SAS=0;AF_Male=8.08303e-06;AF_Female=9.82202e-06;AC_raw=2;AN_raw=244696;AF_raw=8.17341e-06;Hom_AFR=0;Hom_AMR=0;Hom_ASJ=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;Hom_Male=0;Hom_Female=0;Hom_raw=0;Hom=0;STAR_AC=8070;STAR_AC_raw=8437;STAR_Hom=3456;POPMAX=ASJ;AC_POPMAX=1;AN_POPMAX=8884;AF_POPMAX=0.000112562;DP_MEDIAN=71;DREF_MEDIAN=2.50581e-05;GQ_MEDIAN=71;AB_MEDIAN=0.500379;AS_RF=0.248196;AS_FilterStatus=PASS;OLD_MULTIALLELIC=1:152327469:CTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCCGAACTTGACCCATGTTGACCATAGCCAGATGAT/TTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCCGAACTTGACCCATGTTGACCATAGCCAGATGAT/ATGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCCGAACTTGACCCATGTTGACCATAGCCAGATGAT/T;OLD_VARIANT=1:152327469:CTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCCGAACTTGACCCATGTTGACCATAGCCAGATGAT/T

Using first instead of self in the vcfanno config solves the problem, thanks!

Sergey

brentp commented 6 years ago

Sergey, thanks for figuring that out! It was stumping me how it could happen as I only noticed the first 2 variants in gnomad and didn't see that the long deletion was repeated.

matthdsm commented 5 years ago

Hi Brent, Brad

I'm running into a similar issue. Is it possible to provide some kind of default to catch this error? In attachment you can find the "problem" data. My guess is it's also coming from the gnomad dataset.

I'm getting this error:

/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:255: SAWarning: Unicode type received non-unicode bind param value ''. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),),
/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:255: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),),
Traceback (most recent call last):
  File "/home/galaxy/bcbio/anaconda/bin/vcf2db.py", line 924, in <module>
    impacts_extras=a.impacts_field, aok=a.a_ok)
  File "/home/galaxy/bcbio/anaconda/bin/vcf2db.py", line 234, in __init__
    self.load()
  File "/home/galaxy/bcbio/anaconda/bin/vcf2db.py", line 319, in load
    i = self._load(self.cache, create=True, start=1)
  File "/home/galaxy/bcbio/anaconda/bin/vcf2db.py", line 312, in _load
    self.insert(variants, expanded, keys, i, create=create)
  File "/home/galaxy/bcbio/anaconda/bin/vcf2db.py", line 374, in insert
    vilengths, variant_impacts)
  File "/home/galaxy/bcbio/anaconda/bin/vcf2db.py", line 402, in _insert
    self.__insert(v_objs, self.metadata.tables['variants'].insert())
  File "/home/galaxy/bcbio/anaconda/bin/vcf2db.py", line 444, in __insert
    trans.execute(stmt, o)
  File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 980, in execute
    return meth(self, multiparams, params)
  File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 273, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1099, in _execute_clauseelement
    distilled_params,
  File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1240, in _execute_context
    e, statement, parameters, cursor, context
  File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1458, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 296, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1236, in _execute_context
    cursor, statement, parameters, context
  File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 536, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.InterfaceError: (sqlite3.InterfaceError) Error binding parameter 108 - probably unsupported type. [SQL: u'INSERT INTO variants (variant_id, chrom, start, "end", vcf_id, ref, alt, qual, filter, type, sub_type, call_rate, num_hom_ref, num_het, num_hom_alt, num_unknown, aaf, gene, ensembl_gene_id, transcript, is_exonic, is_coding, is_lof, is_splicing, is_canonical, exon, codon_change, aa_change, aa_length, biotype, impact, impact_so, impact_severity, polyphen_pred, polyphen_score, sift_pred, sift_score, ac, af, an, baseqranksum, clippingranksum, db, decomposed, dp, ds, excesshet, fs, mleac, mleaf, mq, mq0, mqranksum, qd, readposranksum, sor, ac_eog, ac_adj_exac_afr, ac_adj_exac_amr, ac_adj_exac_eas, ac_adj_exac_fin, ac_adj_exac_nfe, ac_adj_exac_oth, ac_adj_exac_sas, ac_exac_all, af_adj_exac_afr, af_adj_exac_amr, af_adj_exac_eas, af_adj_exac_fin, af_adj_exac_nfe, af_adj_exac_oth, af_adj_exac_sas, af_esp_aa, af_esp_all, af_esp_ea, af_exac_all, an_eog, an_adj_exac_afr, an_adj_exac_amr, an_adj_exac_eas, an_adj_exac_fin, an_adj_exac_nfe, an_adj_exac_oth, an_adj_exac_sas, an_exac_all, common_pathogenic, gc_het_alt_eog, gc_hom_alt_eog, gc_hom_ref_eog, gnomad_ac, gnomad_ac_amr, gnomad_ac_asj, gnomad_ac_eas, gnomad_ac_fin, gnomad_ac_nfe, gnomad_ac_oth, gnomad_ac_popmax, gnomad_ac_sas, gnomad_af, gnomad_af_afr, gnomad_af_amr, gnomad_af_asj, gnomad_af_eas, gnomad_af_fin, gnomad_af_nfe, gnomad_af_oth, gnomad_af_popmax, gnomad_af_sas, gnomad_an, gnomad_an_amr, gnomad_an_asj, gnomad_an_eas, gnomad_an_fin, gnomad_an_nfe, gnomad_an_oth, gnomad_an_popmax, gnomad_an_sas, gnomad_gc, gnomad_gc_female, gnomad_gc_male, gnomad_hom, gnomad_hom_female, gnomad_hom_male, gnomad_popmax, gnomad_exomes_af, gnomad_genomes_af, max_aaf_all, num_exac_het, num_exac_hom, rs_ids, allele, feature_type, intron, hgvsc, hgvsp, cdna_position, cds_position, existing_variation, allele_num, distance, strand, flags, variant_class, symbol_source, hgnc_id, tsl, appris, ccds, ensp, swissprot, trembl, uniparc, refseq_match, source, given_ref, used_ref, bam_edit, gene_pheno, domains, hgvs_offset, afr_af, amr_af, eas_af, eur_af, sas_af, aa_af, ea_af, gnomad_afr_af, gnomad_amr_af, gnomad_asj_af, gnomad_eas_af, gnomad_fin_af, gnomad_nfe_af, gnomad_oth_af, gnomad_sas_af, max_af, max_af_pops, clin_sig, somatic, pheno, pubmed, motif_name, motif_pos, high_inf_pos, motif_score_change, lof, lof_filter, lof_flags, lof_info, maxentscan_alt, maxentscan_diff, maxentscan_ref, spliceregion, gts, gt_types, gt_phases, gt_depths, gt_ref_depths, gt_alt_depths, gt_quals, gt_alt_freqs) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters: (1, u'chr10', 50099997, 50099998, u'rs61856727', u'A', u'G', 862.7999877929688, None, u'snp', 'ts', 1.0, 0, 1, 0, 0, 0.5, u'WASHC2A', None, u'ENST00000282633', 1, 1, 0, 0, 1, u'17/31', u'aaA/aaG', u'K', u'523/1341', u'protein_coding', u'synonymous_variant', u'synonymous_variant', 'LOW', u'', None, u'', None, 1, u'', 2, -1.9780000448226929, 0.0, 1, 0, 93, 0, 3.0102999210357666, 5.3460001945495605, 1, 0.5, 43.959999084472656, 0, 0.4519999921321869, 9.279999732971191, 1.2170000076293945, 1.3240000009536743, u'A:1111&G:719', 2398.0, 3311.0, 1240.0, 2809.0, 22960.0, 312.0, 4281.0, 37311.0, 0.2791000008583069, 0.29339998960494995, 0.1525000035762787, 0.43689998984336853, 0.367000013589859, 0.3652999997138977, 0.27239999175071716, -1.0, -1.0, -1.0, 0.32850000262260437, 1830.0, 8592.0, 11286.0, 8130.0, 6430.0, 62556.0, 854.0, 15716.0, 113564.0, 0, 523, 98, 294, 20896, 3098, 452, 949, 4318, 9166, 504, 4318, 1344, u'0.3182', 0.19850000739097595, 0.3197999894618988, 0.3368000090122223, 0.12380000203847885, 0.4163999855518341, 0.3831000030040741, 0.329800009727478, 0.4163999855518341, 0.2320999950170517, (187800, 65674), (22212, 9688), (7034, 1342), (11932, 7664), (19040, 10370), (88556, 23926), (4012, 1528), 10370, (23084, 5790), 126737.0, 58367.0, 68370.0, 855, 449, 406, u'FIN', -1.0, -1.0, 0.43689998984336853, 31999.0, 2656.0, u'rs61856727', u'G', u'Transcript', u'', u'ENST00000282633.9:c.1569A>G', u'ENSP00000282633.5:p.Lys523%3D', u'1614/4272', u'1569/4026', u'rs61856727', u'1', u'', u'1', u'', u'SNV', u'HGNC', u'HGNC:23416', u'1', u'P3', u'CCDS41527.1', u'ENSP00000282633', u'Q641Q2', u'', u'UPI000044FEAB', u'', u'Ensembl', u'A', u'A', u'', u'', u'hmmpanther:PTHR21669&hmmpanther:PTHR21669:SF24', u'', u'', u'', u'', u'', u'', u'', u'', u'0.1985', u'0.3198', u'0.3368', u'0.1238', u'0.4164', u'0.3831', u'0.3298', u'0.2321', u'0.4164', u'gnomAD_FIN', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'DE_NOVO_DONOR_PROB:0.0794222613266201&EXON_END:50100064&MUTANT_DONOR_MES:8.97796800884188&DE_NOVO_DONOR_MES_POS:-71&DE_NOVO_DONOR_MES:-1.85392529798293&INTRON_END:50104041&EXON_START:50099978&INTRON_START:50100065&DE_NOVO_DONOR_POS:-71', u'', u'', u'', u'', <read-only buffer for 0x7f8ae44084e0, size -1, offset 0 at 0x7f8ae37d3570>, <read-only buffer for 0x7f8ae4408450, size -1, offset 0 at 0x7f8ae37d35b0>, <read-only buffer for 0x7f8ae4408540, size -1, offset 0 at 0x7f8ae37d35f0>, <read-only buffer for 0x7f8ae44085d0, size -1, offset 0 at 0x7f8ae37d3630>, <read-only buffer for 0x7f8ae4408630, size -1, offset 0 at 0x7f8ae37d3670>, <read-only buffer for 0x7f8ae44086c0, size -1, offset 0 at 0x7f8ae37d36b0>, <read-only buffer for 0x7f8ae4408720, size -1, offset 0 at 0x7f8ae37d36f0>, <read-only buffer for 0x7f8ae4408780, size -1, offset 0 at 0x7f8ae37d3730>)] (Background on this error at: http://sqlalche.me/e/rvf5)

And here's the data: vcf2db_error.zip

Thanks for the help.

Cheers M

matthdsm commented 5 years ago

I've identified multiple variants causing issues when generating the db:

rs35465208
rs61856727
rs4076899
rs367813564
rs59195224
rs2072564
rs773222534
rs11669847
rs11669854
rs11669884
rs61744491
rs61737632
rs762214476
rs143680639

Decomposing and normalizing the gnomad dataset doesn't fix the problem. Anyone have any ideas?

Thanks M

brentp commented 5 years ago

the solution is to just use "first". vcfanno tries to do the right thing for so many cases, for this case, where a variant is repeated in gnomad, it's easy enough to use "first"

pfpjs commented 5 years ago

This issue looks quite similar to the one I hit in issue #2635. Looking at @matthdsm's error, it fails because the following fields have two values instead of one in gnomAD's VCFs: gnomad_an, gnomad_an_amr, gnomad_an_asj, gnomad_an_eas, gnomad_an_fin, gnomad_an_nfe, gnomad_an_oth, gnomad_an_popmax, gnomad_an_sas

The field gnomad_an is parameter 108 in the SQAlchemy error, and it has the values (187800, 65674). In gnomAD's page for this particular variant, we see that 65674 is the alllele number with this variant, while 187800 is probably the total allele number (just a guess).

So, @brentp's suggestion of using first instead of self might not be the best option here, as it would give the total allele number out of all gnomAD's exomes and genomes, and not the allele number with our variant of interest.

gnomAD's VCF header says this field should have one value only:

##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in call
ed genotypes">

Not sure exactly what the best option here is, just a warning that, while using first in vcfanno would fix the error, it might give an incorrect result.

-- Paulo

matthdsm commented 5 years ago

I fully agree with Paulo. Just grabbing the first entry seems a tad dangerous to me. We could be missing out on just that one piece of data we're looking for. If we HAVE to select one, I'd suggest selecting the one with the lowest AF (e.g. the rarest one), so we can at least catch those.

Thoughts?

Thanks M

matthdsm commented 5 years ago

Hi Brent, Just a question. I'm getting this warning

WARNING: using op 'self' when with Number='1' for 'AC_ACR' from '/data/gent/vo/000/gvo00082/vsc41443/gnomad/variation/tmp/gnomad.exomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz' can result in out-of-order values when the query is multi-allelic
api.go:805: this is not an issue if the query has been decomposed.

Could you explain a bit please? Does this mean that vcfanno will extract the correct data from the "source" file (gnomad in this case) if the "query" (the file you're annotating) is decomposed? Or am I interpreting this wrong?

matthdsm commented 5 years ago

I just opened a PR #2673, which fixes some errors in the field names. I tried running the pipeline with several configs and found the following:

the Gnomad vcf doesn't really need to be decomposed and normalized to be used (@brentp, could you confirm this?
All sql errors go away when using self on all fields, except the AN and GC fields

If you're okay with this, I'll add the modifications to my PR.

Cheers M

brentp commented 5 years ago

yeah. I found that the latest vcfs are already decomposed and normalized.

matthdsm commented 5 years ago

Hi Brent,

Thanks for chiming in. Could you specify what you mean by latest? For example, bcbio uses v2.0.1, lifted over by Ensembl. Is this one of the versions already preprocessed?

Thanks M

brentp commented 5 years ago

I don't know about 2.0.1, but 2.1 is preprocessed.

roryk commented 5 years ago

Sqlalchemy.exc.InterfaceError: (sqlite3.InterfaceError) Error binding parameter 110 - probably unsupported type. [SQL: u'INSERT INTO variants (variant_id, chrom, start, "end", vcf_id, ref, alt, qual, filter, type, sub_type, call_rate, num_hom_ref, num_het, num_hom_alt, num_unknown, aaf, gene, ensembl_gene_id, transcript, is_exonic, is_coding, is_lof, is_splicing, is_canonical, exon, codon_change, aa_change, aa_length, biotype, impact, impact_so, impact_severity, polyphen_pred, polyphen_score, sift_pred, sift_score, ac, af, an, baseqranksum, clippingranksum, db, decomposed, dp, ds, excesshet, fs, len, lof, mleac, mleaf, mq, mq0, mqranksum, nmd, old_multiallelic, old_variant, qd, readposranksum, sor, ac_adj_exac_afr, ac_adj_exac_amr, ac_adj_exac_eas, ac_adj_exac_fin, ac_adj_exac_nfe, ac_adj_exac_oth, ac_adj_exac_sas, ac_exac_all, af_adj_exac_afr, af_adj_exac_amr, af_adj_exac_eas, af_adj_exac_fin, af_adj_exac_nfe, af_adj_exac_oth, af_adj_exac_sas, af_esp_aa, af_esp_all, af_esp_ea, af_exac_all, an_adj_exac_afr, an_adj_exac_amr, an_adj_exac_eas, an_adj_exac_fin, an_adj_exac_nfe, an_adj_exac_oth, an_adj_exac_sas, an_exac_all, clinvar_disease_name, clinvar_sig, common_pathogenic, gnomad_ac, gnomad_ac_amr, gnomad_ac_asj, gnomad_ac_eas, gnomad_ac_fin, gnomad_ac_nfe, gnomad_ac_oth, gnomad_ac_popmax, gnomad_ac_sas, gnomad_af, gnomad_af_afr, gnomad_af_amr, gnomad_af_asj, gnomad_af_eas, gnomad_af_fin, gnomad_af_nfe, gnomad_af_oth, gnomad_af_popmax, gnomad_af_sas, gnomad_an, gnomad_an_amr, gnomad_an_asj, gnomad_an_eas, gnomad_an_fin, gnomad_an_nfe, gnomad_an_oth, gnomad_an_popmax, gnomad_an_sas, gnomad_gc, gnomad_gc_female, gnomad_gc_male, gnomad_hom, gnomad_hom_female, gnomad_hom_male, gnomad_popmax, max_aaf_all, num_exac_het, num_exac_hom, rs_ids, gts, gt_types, gt_phases, gt_depths, gt_ref_depths, gt_alt_depths, gt_quals, gt_alt_freqs) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters: (441616, u'chr10', 50095647, 50095648, u'rs1273688830', u'A', u'G', 71.5999984741211, None, 'snp', 'ts', 1.0, 0, 1, 0, 0, 0.5, u'FAM21A', None, u'ENST00000282633.9', 1, 1, 0, 0, 0, u'15/31', u'c.1290A>G', u'p.Pro430Pro', 1341, u'protein_coding', u'synonymous_variant', u'synonymous_variant', 'LOW', None, None, None, None, 1, 0.5, 2, -0.6740000247955322, 0.0, 1, 0, 4, 0, 3.0102999210357666, 0.0, None, 'None', 1, 0.5, 57.65999984741211, 0, -0.3190000057220459, 'None', None, 'None', 17.899999618530273, 0.0, 0.9160000085830688, 408.0, 99.0, 120.0, 153.0, 1206.0, 10.0, 231.0, 2227.0, 0.11620000004768372, 0.02759999968111515, 0.03550000116229057, 0.08110000193119049, 0.05570000037550926, 0.03359999880194664, 0.03060000017285347, 0.06895402818918228, 0.04439403861761093, 0.032608695328235626, 0.053199999034404755, 3512.0, 3592.0, 3382.0, 1886.0, 21650.0, 298.0, 7554.0, 41874.0, None, None, 0, 6331, 537, 312, 304, 786, 2850, 133, 894, 515, 0.04010000079870224, 0.09629999846220016, 0.023099999874830246, 0.042399998754262924, 0.025100000202655792, 0.05209999904036522, 0.04259999841451645, 0.0364999994635582, 0.09629999846220016, 0.02539999969303608, (105416, 157826), (12922, 23204), (4264, 7354), (6672, 12128), (11754, 15082), (48900, 66842), (2440, 3646), 9284, (13374, 20286), 131621.0, 59890.0, 71731.0, 1493, 679, 814, u'AFR', 0.11620000004768372, 1479.0, 374.0, u'rs1273688830', <read-only buffer for 0x7fd82e3a91e0, size -1, offset 0 at 0x7fd8280e68b0>, <read-only buffer for 0x7fd82e3a91b0, size -1, offset 0 at 0x7fd8280e6870>, <read-only buffer for 0x7fd82e3a9210, size -1, offset 0 at 0x7fd8280e6830>, <read-only buffer for 0x7fd82e3a9240, size -1, offset 0 at 0x7fd8280e67f0>, <read-only buffer for 0x7fd82e3a9270, size -1, offset 0 at 0x7fd8280e67b0>, <read-only buffer for 0x7fd82e3a92a0, size -1, offset 0 at 0x7fd8280e6770>, <read-only buffer for 0x7fd82e3a92d0, size -1, offset 0 at 0x7fd8280e6670>, <read-only buffer for 0x7fd82e3a9300, size -1, offset 0 at 0x7fd8280e6730>)] (Background on this error at: http://sqlalche.me/e/rvf5)

I'm still seeing these errors. This one looks like gnomAD_AF_SAS which is a tuple of values and not a single value.

The vcfanno config looks like it is set for self:

[[annotation]]
file="/n/app/bcbio/dev/genomes/Hsapiens/hg38/variation/gnomad_exome.vcf.gz"
fields=["AF","AF_AFR","AF_AMR","AF_ASJ","AF_EAS","AF_FIN","AF_NFE","AF_OTH","AF_SAS","AF_POPMAX","AC","AC_ACR","AC_AMR","AC_ASJ","AC_EAS","AC_FIN","AC_NFE","AC_OTH","AC_SAS","AC_POPMAX","AN","AN_ANR","AN_AMR","AN_ASJ","AN_EAS","AN_FIN","AN_NFE","AN_OTH","AN_SAS","AN_POPMAX","POPMAX","GC","GC_Male","GC_Female","Hom","Hom_Male","Hom_Female"]
names=["gnomAD_AF","gnomAD_AF_AFR","gnomAD_AF_AMR","gnomAD_AF_ASJ","gnomAD_AF_EAS","gnomAD_AF_FIN","gnomAD_AF_NFE","gnomAD_AF_OTH","gnomAD_AF_SAS","gnomAD_AF_POPMAX","gnomAD_AC","gnomAD_AC_ACR","gnomAD_AC_AMR","gnomAD_AC_ASJ","gnomAD_AC_EAS","gnomAD_AC_FIN","gnomAD_AC_NFE","gnomAD_AC_OTH","gnomAD_AC_SAS","gnomAD_AC_POPMAX","gnomAD_AN","gnomAD_AN_ANR","gnomAD_AN_AMR","gnomAD_AN_ASJ","gnomAD_AN_EAS","gnomAD_AN_FIN","gnomAD_AN_NFE","gnomAD_AN_OTH","gnomAD_AN_SAS","gnomAD_AN_POPMAX","gnomAD_POPMAX","gnomAD_GC","gnomAD_GC_Male","gnomAD_GC_Female","gnomAD_Hom","gnomAD_Hom_Male","gnomAD_Hom_Female"]
ops=["self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","sum","sum","sum","self","self","self"]

Could someone point me in the right direction for further debugging? Thanks!

naumenko-sa commented 5 years ago

Hi @roryk!

Probably, it is because gnomad_exome.vcf.gz has two identical records at chr10:50095648

tabix gnomad_exome.vcf.gz chr10:50095647-50095648 | cut -f 1-5
chr10   50095648    .   A   G
chr10   50095648    rs201672329 A   G

I've added vt uniq to the ggd recipe of gnomad_exome.vcf.gz in hg38. https://github.com/chapmanb/cloudbiolinux/pull/286. Please let me know if that would fix the problem for you, I'll then fix other vcf recipes.

Sergey

matthdsm commented 5 years ago

Hi @naumenko-sa , @roryk

I'm not really a fan of this... seems like an even worse solution than using the "first" op in vcfanno. When using uniq you're essentially removing variants based on the order they appear. When annotating this will cause us to miss all of those special cases..

Wouldn't it make more sense to awk away all records with an empty ID field? I've noticed all problematic variants have one record with an rsID and one without.

matthdsm commented 5 years ago

@roryk As for you debug question, I don't know what you're looking for exactly, but I'd suggest changing the ops for your problematic field to min, in compliance with the AN_* fields. That way, you're sure you're annotating with the rarest variant.

chapmanb commented 5 years ago

Matthias, Sergey and Rory; Apologies, I'd just merged Sergey's fix ahead of your comments. We can revert that and update the vcfanno configuration either if that makes more sense. I don't have a lot of experience with these special cases in practice so happy to defer to whatever the consensus is.

matthdsm commented 5 years ago

Hi Brad, I'm happy to with the consensus, but outright removing data seems a tad harsh to me. This also removes to possibility for users to tweak the vcfanno config to whatever use case they need. I'd rather keep the data as complete as possible and filter unneeded data, than deleting it.

Thanks M

pfpjs commented 5 years ago

@matthdsm you recently mentioned using the raw Ensembl gnomAD VCFs (GRCh37 and hg38) for annotating directly with vcfanno. Would this issue also show up when using those files?

matthdsm commented 5 years ago

Yes, the issue remains the same. I only mentioned that because when using the raw ensembl file, we'd be able to cut down on processing time during the installation. I didn't notice any differences between the "raw" and "processed" gnomad exome file. M

naumenko-sa commented 5 years ago

Hello everyone!

Thanks for the discussion, it helped me to learn a lot more about the issue!

Gnomad 2.1 from the gnomad website: https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr10.vcf.bgz is indeed clean, sorted and normalized vcf file, see bcftools stats for chr10:

# SN    [2]id   [3]key  [4]value
SN  0   number of samples:  0
SN  0   number of records:  665745
SN  0   number of no-ALTs:  0
SN  0   number of SNPs: 619617
SN  0   number of MNPs: 0
SN  0   number of indels:   46128
SN  0   number of others:   0
SN  0   number of multiallelic sites:   0
SN  0   number of multiallelic SNP sites:   0

There is no issue with duplicated chr10 variant in it:

tabix gnomad.exomes.r2.1.sites.chr10.vcf.bgz chr10:50095648-50095648
<EMPTY OUTPUT>

However, gnomad 2.1 from gnomad weights 60G, it contains many tags and VEP annotation. Ensembl provides 5G gnomad without VEP annotation. Cloudbiolinux GGD recipe pulls gnomad_exome from ensembl: https://github.com/chapmanb/cloudbiolinux/blob/master/ggd-recipes/hg38/gnomad_exome.yaml

Ensembl's gnomad_exome is v2.0.1, it is not sorted, not decomposed, and not normalized, and it has an issue with duplicated variants (bcftools stats for chr10):

# SN    [2]id   [3]key  [4]value
SN  0   number of samples:  0
SN  0   number of records:  583100
SN  0   number of no-ALTs:  0
SN  0   number of SNPs: 554086
SN  0   number of MNPs: 66
SN  0   number of indels:   40054
SN  0   number of others:   338
SN  0   number of multiallelic sites:   66250
SN  0   number of multiallelic SNP sites:   52822

Issue with duplicated variant:

tabix gnomad.exomes.r2.0.1.sites.GRCh38.noVEP.sorted.chr10.vcf.gz chr10:50095648-50095648
chr10   50095648    .   A   G   7379.2  RF;AC0  AC=0;AF=0;AN=105416;BaseQRankSum=0;ClippingRankSum=0.196;DP=2665618;FS=74.3;InbreedingCoeff=0.0209;MQ=23.95;MQRankSum=0;QD=6.76;ReadPosRankSum=0.358;SOR=4.733;VQSLOD=-221.9;VQSR_culprit=FS;GQ_HIST_ALT=2|44|9|12|2|2|0|4|6|4|0|0|4|1|0|1|0|0|0|0;DP_HIST_ALT=65|22|0|0|0|1|0|2|1|0|0|0|0|0|0|0|0|0|0|0;AB_HIST_ALT=0|0|3|1|1|1|0|1|4|0|9|0|5|6|0|2|1|1|0|0;GQ_HIST_ALL=5197|7001|4213|8918|7852|3162|5118|4039|1689|2849|2482|1151|3003|499|1655|845|2133|299|2102|15845;DP_HIST_ALL=10460|13434|12411|7664|6367|5948|6206|6267|3998|2901|2040|1316|633|275|85|29|9|4|1|1;AB_HIST_ALL=0|0|3|1|1|1|0|1|4|0|9|0|5|6|0|2|1|1|0|0;AC_AFR=0;AC_AMR=0;AC_ASJ=0;AC_EAS=0;AC_FIN=0;AC_NFE=0;AC_OTH=0;AC_SAS=0;AC_Male=0;AC_Female=0;AN_AFR=5090;AN_AMR=12922;AN_ASJ=4264;AN_EAS=6672;AN_FIN=11754;AN_NFE=48900;AN_OTH=2440;AN_SAS=13374;AN_Male=57722;AN_Female=47694;AF_AFR=0;AF_AMR=0;AF_ASJ=0;AF_EAS=0;AF_FIN=0;AF_NFE=0;AF_OTH=0;AF_SAS=0;AF_Male=0;AF_Female=0;GC_AFR=2545,0,0;GC_AMR=6461,0,0;GC_ASJ=2132,0,0;GC_EAS=3336,0,0;GC_FIN=5877,0,0;GC_NFE=24450,0,0;GC_OTH=1220,0,0;GC_SAS=6687,0,0;GC_Male=28861,0,0;GC_Female=23847,0,0;AC_raw=147;AN_raw=160104;AF_raw=0.000918153;GC_raw=79961,35,56;GC=52708,0,0;Hom_AFR=0;Hom_AMR=0;Hom_ASJ=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;Hom_Male=0;Hom_Female=0;Hom_raw=56;Hom=0;POPMAX=.;AC_POPMAX=.;AN_POPMAX=.;AF_POPMAX=.;DP_MEDIAN=3;DREF_MEDIAN=6.34858e-07;GQ_MEDIAN=9;AB_MEDIAN=0.5;AS_RF=0.0766183;AS_FilterStatus=RF|AC0
chr10   50095648    rs201672329 A   G   4.59591e+06 PASS    AC=6331;AF=0.0401138;AN=157826;BaseQRankSum=-0.494;ClippingRankSum=0.024;DB;DP=4164344;FS=0.536;InbreedingCoeff=0.3791;MQ=26.03;MQRankSum=0;QD=14.26;ReadPosRankSum=0.267;SOR=0.634;VQSLOD=1.79;VQSR_culprit=MQ;GQ_HIST_ALT=41|165|136|275|231|113|194|163|96|131|118|105|137|122|86|130|146|82|106|3409;DP_HIST_ALT=236|828|885|775|583|415|320|232|208|205|181|197|162|143|100|73|66|46|30|24;AB_HIST_ALT=0|0|10|24|70|142|183|337|613|589|785|410|253|105|36|45|44|47|0|0;GQ_HIST_ALL=3343|6713|4663|10424|9906|4535|7763|6741|2764|4784|3731|1595|4284|781|2125|1264|2833|378|2571|25713;DP_HIST_ALL=7947|16027|16329|13266|10848|8104|6717|6691|5335|4890|3734|2685|1589|919|501|309|191|122|86|62;AB_HIST_ALL=0|0|10|24|70|142|183|337|613|589|784|410|254|105|36|45|44|47|0|0;AC_AFR=894;AC_AMR=537;AC_ASJ=312;AC_EAS=304;AC_FIN=786;AC_NFE=2850;AC_OTH=133;AC_SAS=515;AC_Male=3397;AC_Female=2934;AN_AFR=9284;AN_AMR=23204;AN_ASJ=7354;AN_EAS=12128;AN_FIN=15082;AN_NFE=66842;AN_OTH=3646;AN_SAS=20286;AN_Male=85740;AN_Female=72086;AF_AFR=0.0962947;AF_AMR=0.0231426;AF_ASJ=0.0424259;AF_EAS=0.025066;AF_FIN=0.0521151;AF_NFE=0.0426379;AF_OTH=0.0364783;AF_SAS=0.025387;AF_Male=0.0396198;AF_Female=0.0407014;GC_AFR=3952,486,204;GC_AMR=11178,311,113;GC_ASJ=3436,170,71;GC_EAS=5818,188,58;GC_FIN=6979,338,224;GC_NFE=31239,1514,668;GC_OTH=1716,81,26;GC_SAS=9757,257,129;GC_Male=40287,1769,814;GC_Female=33788,1576,679;AC_raw=8279;AN_raw=213822;AF_raw=0.0387191;GC_raw=100925,3693,2293;GC=74075,3345,1493;Hom_AFR=204;Hom_AMR=113;Hom_ASJ=71;Hom_EAS=58;Hom_FIN=224;Hom_NFE=668;Hom_OTH=26;Hom_SAS=129;Hom_Male=814;Hom_Female=679;Hom_raw=2293;Hom=1493;POPMAX=AFR;AC_POPMAX=894;AN_POPMAX=9284;AF_POPMAX=0.0962947;DP_MEDIAN=22;DREF_MEDIAN=6.30957e-33;GQ_MEDIAN=99;AB_MEDIAN=0.481481;AS_RF=0.865029;AS_FilterStatus=PASS

Hopefully, Ensembl will soon sync their vcf with gnomad2.1 and we will not need processing the vcf in cloudbiolinux/bcbio installation anymore. In the meanwhile, we need that, at least for users who are running bcbio for the first time. They just need a smooth pipeline run, i.e. to call, annotate and prioritize variants, not going into details about little issues of gnomad 2.0.1 reprocessed by ensembl.
One of the duplicated variants has PASS in the filter tag and the other has not. That gave me an idea to prioritize PASS variants rather than first variants when selecting unique ones.

The processing: chr name conversion - sort - PASS - decompose - normalize - uniq works:

bcftools stats for chr10
# SN    [2]id   [3]key  [4]value
SN  0   number of samples:  0
SN  0   number of records:  538598
SN  0   number of no-ALTs:  537
SN  0   number of SNPs: 503850
SN  0   number of MNPs: 59
SN  0   number of indels:   33917
SN  0   number of others:   235
SN  0   number of multiallelic sites:   0
SN  0   number of multiallelic SNP sites:   0

Duplicated variant in chr10 is gone:

tabix gnomad.exomes.r2.0.1.sites.GRCh38.noVEP.sorted.chr10.processed.vcf.gz chr10:50095648-50095648
chr10   50095648    rs201672329 A   G   4.59591e+06 PASS    AC=6331;AF=0.0401138;AN=157826;BaseQRankSum=-0.494;ClippingRankSum=0.024;DB;DP=4164344;FS=0.536;InbreedingCoeff=0.3791;MQ=26.03;MQRankSum=0;QD=14.26;ReadPosRankSum=0.267;SOR=0.634;VQSLOD=1.79;VQSR_culprit=MQ;GQ_HIST_ALT=41|165|136|275|231|113|194|163|96|131|118|105|137|122|86|130|146|82|106|3409;DP_HIST_ALT=236|828|885|775|583|415|320|232|208|205|181|197|162|143|100|73|66|46|30|24;AB_HIST_ALT=0|0|10|24|70|142|183|337|613|589|785|410|253|105|36|45|44|47|0|0;GQ_HIST_ALL=3343|6713|4663|10424|9906|4535|7763|6741|2764|4784|3731|1595|4284|781|2125|1264|2833|378|2571|25713;DP_HIST_ALL=7947|16027|16329|13266|10848|8104|6717|6691|5335|4890|3734|2685|1589|919|501|309|191|122|86|62;AB_HIST_ALL=0|0|10|24|70|142|183|337|613|589|784|410|254|105|36|45|44|47|0|0;AC_AFR=894;AC_AMR=537;AC_ASJ=312;AC_EAS=304;AC_FIN=786;AC_NFE=2850;AC_OTH=133;AC_SAS=515;AC_Male=3397;AC_Female=2934;AN_AFR=9284;AN_AMR=23204;AN_ASJ=7354;AN_EAS=12128;AN_FIN=15082;AN_NFE=66842;AN_OTH=3646;AN_SAS=20286;AN_Male=85740;AN_Female=72086;AF_AFR=0.0962947;AF_AMR=0.0231426;AF_ASJ=0.0424259;AF_EAS=0.025066;AF_FIN=0.0521151;AF_NFE=0.0426379;AF_OTH=0.0364783;AF_SAS=0.025387;AF_Male=0.0396198;AF_Female=0.0407014;GC_AFR=3952,486,204;GC_AMR=11178,311,113;GC_ASJ=3436,170,71;GC_EAS=5818,188,58;GC_FIN=6979,338,224;GC_NFE=31239,1514,668;GC_OTH=1716,81,26;GC_SAS=9757,257,129;GC_Male=40287,1769,814;GC_Female=33788,1576,679;AC_raw=8279;AN_raw=213822;AF_raw=0.0387191;GC_raw=100925,3693,2293;GC=74075,3345,1493;Hom_AFR=204;Hom_AMR=113;Hom_ASJ=71;Hom_EAS=58;Hom_FIN=224;Hom_NFE=668;Hom_OTH=26;Hom_SAS=129;Hom_Male=814;Hom_Female=679;Hom_raw=2293;Hom=1493;POPMAX=AFR;AC_POPMAX=894;AN_POPMAX=9284;AF_POPMAX=0.0962947;DP_MEDIAN=22;DREF_MEDIAN=6.30957e-33;GQ_MEDIAN=99;AB_MEDIAN=0.481481;AS_RF=0.865029;AS_FilterStatus=PASS

I've updated the GGD for gnomad_exome in hg38 according to this test. Please let me know if you have any other suggestions. https://github.com/chapmanb/cloudbiolinux/pull/287

Sergey

matthdsm commented 5 years ago

Hi Sergey,

Nice job getting into the nitty gritty of this issue! Ensembl plans to release their version of GnomAD 2.1 somewhere in March, for release 96 (http://www.ensembl.info/2019/02/07/whats-coming-in-ensembl-96-ensembl-genomes-43/#comment-94690). It's too bad Broad doesn't offer a lifted version of GnomAD themselves, otherwise this issue would be quickly solved.

I do have some further questions:

Did you test v2.0.1 files from Broad for duplicated variants? If not, could you? This way we'd be able to determine if the issue is present in the source, or some artifact from the liftover and processing Ensembl has done. Once we know this, we can tackle issues like this at the source, instead of trying to patch it up with postprocessing.
After processing using your updated method, do you have any stats on how many variants are filtered out? According to your stats, we lose about 44K records on chr10 alone.
When annotating using the newly formatted vcf, did you use the default bcbio vcfanno config? Or did you restore the original one (e.g. `ops=["self"] on all fields)

And then some questions for everyone:

Since Ensembl tends to lag a bit on updated bigger sets such as GnomAD, does anyone thinks it'd be viable to do the liftover ourselves? I have no experience lifting data whatsoever, so no idea what it entails or what some common pitfalls are (hence the deffering to Ensembl). I suppose it'd be easy enough to filter out the unneeded VEP annotations to create a more compact file, but I'm not sure about the rest of the process

Thanks again for al the work, nice job.

Cheers M

matthdsm commented 5 years ago

Update bcftools stats for GnomAD 2.0.1, chr10

$ bcftools stats https://storage.googleapis.com/gnomad-public/release/2.0.1/vcf/exomes/gnomad.exomes.r2.0.1.sites.vcf.gz -r 10  
...
# SN    [2]id   [3]key  [4]value
SN  0   number of samples:  0
SN  0   number of records:  584309
SN  0   number of no-ALTs:  0
SN  0   number of SNPs: 555360
SN  0   number of MNPs: 0
SN  0   number of indels:   40395
SN  0   number of others:   0
SN  0   number of multiallelic sites:   66269
SN  0   number of multiallelic SNP sites:   52835
...

And when checking the same region on chr10, there's no output.

 $tabix https://storage.googleapis.com/gnomad-public/release/2.0.1/vcf/exomes/gnomad.exomes.r2.0.1.sites.vcf.gz 10:50095648-50095648 
<EMPTY OUTPUT>

Which kind of makes sense, because this is 38 build region we're querying against a build 37 file... Also, we have to keep in mind GnomAD doens't use the chr prefix, so that's something to take into account when querying the original files.

BUT, when doing the same with lifted coordinates to 37, I get one record

$ tabix https://storage.googleapis.com/gnomad-public/release/2.0.1/vcf/exomes/gnomad.exomes.r2.0.1.sites.vcf.gz 10:51855408-51855408                                               
10  51855408    .   A   G   7379.20 RF;AC0  AC=0;AF=0.00000e+00;AN=105416;BaseQRankSum=0.00000e+00;ClippingRankSum=1.96000e-01;DP=2665618;FS=7.43000e+01;InbreedingCoeff=2.09000e-02;MQ=2.39500e+01;MQRankSum=0.00000e+00;QD=6.76000e+00;ReadPosRankSum=3.58000e-01;SOR=4.73300e+00;VQSLOD=-2.21900e+02;VQSR_culprit=FS;GQ_HIST_ALT=2|44|9|12|2|2|0|4|6|4|0|0|4|1|0|1|0|0|0|0;DP_HIST_ALT=65|22|0|0|0|1|0|2|1|0|0|0|0|0|0|0|0|0|0|0;AB_HIST_ALT=0|0|3|1|1|1|0|1|4|0|9|0|5|6|0|2|1|1|0|0;GQ_HIST_ALL=5197|7001|4213|8918|7852|3162|5118|4039|1689|2849|2482|1151|3003|499|1655|845|2133|299|2102|15845;DP_HIST_ALL=10460|13434|12411|7664|6367|5948|6206|6267|3998|2901|2040|1316|633|275|85|29|9|4|1|1;AB_HIST_ALL=0|0|3|1|1|1|0|1|4|0|9|0|5|6|0|2|1|1|0|0;AC_AFR=0;AC_AMR=0;AC_ASJ=0;AC_EAS=0;AC_FIN=0;AC_NFE=0;AC_OTH=0;AC_SAS=0;AC_Male=0;AC_Female=0;AN_AFR=5090;AN_AMR=12922;AN_ASJ=4264;AN_EAS=6672;AN_FIN=11754;AN_NFE=48900;AN_OTH=2440;AN_SAS=13374;AN_Male=57722;AN_Female=47694;AF_AFR=0.00000e+00;AF_AMR=0.00000e+00;AF_ASJ=0.00000e+00;AF_EAS=0.00000e+00;AF_FIN=0.00000e+00;AF_NFE=0.00000e+00;AF_OTH=0.00000e+00;AF_SAS=0.00000e+00;AF_Male=0.00000e+00;AF_Female=0.00000e+00;GC_AFR=2545,0,0;GC_AMR=6461,0,0;GC_ASJ=2132,0,0;GC_EAS=3336,0,0;GC_FIN=5877,0,0;GC_NFE=24450,0,0;GC_OTH=1220,0,0;GC_SAS=6687,0,0;GC_Male=28861,0,0;GC_Female=23847,0,0;AC_raw=147;AN_raw=160104;AF_raw=9.18153e-04;GC_raw=79961,35,56;GC=52708,0,0;Hom_AFR=0;Hom_AMR=0;Hom_ASJ=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;Hom_Male=0;Hom_Female=0;Hom_raw=56;Hom=0;POPMAX=.;AC_POPMAX=.;AN_POPMAX=.;AF_POPMAX=.;DP_MEDIAN=3;DREF_MEDIAN=6.34858e-07;GQ_MEDIAN=9;AB_MEDIAN=5.00000e-01;AS_RF=7.66183e-02;AS_FilterStatus=RF|AC0;CSQ=G|synonymous_variant|LOW|FAM21A|ENSG00000099290|Transcript|ENST00000282633|protein_coding|15/31||ENST00000282633.5:c.1290A>G|ENST00000282633.5:c.1290A>G(p.%3D)|1335|1290|430|P|ccA/ccG||1||1||SNV|1|HGNC|23416|YES|||CCDS41527.1|ENSP00000282633|Q641Q2|Q6P0Q7&Q5SNT8&B4E255|UPI000044FEAB||||hmmpanther:PTHR21669&hmmpanther:PTHR21669:SF4||||||||||||||||||||||||||||||,G|synonymous_variant|LOW|FAM21A|ENSG00000099290|Transcript|ENST00000314664|protein_coding|15/29||ENST00000314664.7:c.1290A>G|ENST00000314664.7:c.1290A>G(p.%3D)|1408|1290|430|P|ccA/ccG||1||1||SNV|1|HGNC|23416|||||ENSP00000314417||Q6P0Q7&E7ESD2|UPI00020658FD||||hmmpanther:PTHR21669&hmmpanther:PTHR21669:SF4||||||||||||||||||||||||||||||,G|synonymous_variant|LOW|FAM21A|ENSG00000099290|Transcript|ENST00000351071|protein_coding|15/30||ENST00000351071.6:c.1290A>G|ENST00000351071.6:c.1290A>G(p.%3D)|1408|1290|430|P|ccA/ccG||1||1||SNV|1|HGNC|23416|||||ENSP00000344037|Q641Q2|Q9Y4N4&Q6P0Q7|UPI00001C1ED7||||hmmpanther:PTHR21669&hmmpanther:PTHR21669:SF4||||||||||||||||||||||||||||||,G|synonymous_variant|LOW|FAM21A|ENSG00000099290|Transcript|ENST00000399339|protein_coding|13/29||ENST00000399339.2:c.1026A>G|ENST00000399339.2:c.1026A>G(p.%3D)|1026|1026|342|P|ccA/ccG||1||1||SNV|1|HGNC|23416|||||ENSP00000382277||Q6P0Q7&Q5SNT8&F8W7U3&B4E255|UPI0001AE6DA1||||hmmpanther:PTHR21669&hmmpanther:PTHR21669:SF4||||||||||||||||||||||||||||||,G|3_prime_UTR_variant&NMD_transcript_variant|MODIFIER|FAM21A|ENSG00000099290|Transcript|ENST00000434114|nonsense_mediated_decay|13/28||ENST00000434114.2:c.*656A>G||1059||||||1||1||SNV|1|HGNC|23416|||||ENSP00000403781||Q5T1D7|UPI0002B8336F||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|FAM21A|ENSG00000099290|Transcript|ENST00000492914|processed_transcript|||||||||||1|4143|1||SNV|1|HGNC|23416||||||||||||||||||||||||||||||||||||||||||

Doing the same on the Ensembl file (b37) yields the same record (minus the VEP annotations)

tabix http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/gnomad.exomes.r2.0.1.sites.noVEP.vcf.gz 10:51855408-51855408                                  [8:47:46]
10  51855408    .   A   G   7379.20 RF;AC0  AC=0;AF=0.00000e+00;AN=105416;BaseQRankSum=0.00000e+00;ClippingRankSum=1.96000e-01;DP=2665618;FS=7.43000e+01;InbreedingCoeff=2.09000e-02;MQ=2.39500e+01;MQRankSum=0.00000e+00;QD=6.76000e+00;ReadPosRankSum=3.58000e-01;SOR=4.73300e+00;VQSLOD=-2.21900e+02;VQSR_culprit=FS;GQ_HIST_ALT=2|44|9|12|2|2|0|4|6|4|0|0|4|1|0|1|0|0|0|0;DP_HIST_ALT=65|22|0|0|0|1|0|2|1|0|0|0|0|0|0|0|0|0|0|0;AB_HIST_ALT=0|0|3|1|1|1|0|1|4|0|9|0|5|6|0|2|1|1|0|0;GQ_HIST_ALL=5197|7001|4213|8918|7852|3162|5118|4039|1689|2849|2482|1151|3003|499|1655|845|2133|299|2102|15845;DP_HIST_ALL=10460|13434|12411|7664|6367|5948|6206|6267|3998|2901|2040|1316|633|275|85|29|9|4|1|1;AB_HIST_ALL=0|0|3|1|1|1|0|1|4|0|9|0|5|6|0|2|1|1|0|0;AC_AFR=0;AC_AMR=0;AC_ASJ=0;AC_EAS=0;AC_FIN=0;AC_NFE=0;AC_OTH=0;AC_SAS=0;AC_Male=0;AC_Female=0;AN_AFR=5090;AN_AMR=12922;AN_ASJ=4264;AN_EAS=6672;AN_FIN=11754;AN_NFE=48900;AN_OTH=2440;AN_SAS=13374;AN_Male=57722;AN_Female=47694;AF_AFR=0.00000e+00;AF_AMR=0.00000e+00;AF_ASJ=0.00000e+00;AF_EAS=0.00000e+00;AF_FIN=0.00000e+00;AF_NFE=0.00000e+00;AF_OTH=0.00000e+00;AF_SAS=0.00000e+00;AF_Male=0.00000e+00;AF_Female=0.00000e+00;GC_AFR=2545,0,0;GC_AMR=6461,0,0;GC_ASJ=2132,0,0;GC_EAS=3336,0,0;GC_FIN=5877,0,0;GC_NFE=24450,0,0;GC_OTH=1220,0,0;GC_SAS=6687,0,0;GC_Male=28861,0,0;GC_Female=23847,0,0;AC_raw=147;AN_raw=160104;AF_raw=9.18153e-04;GC_raw=79961,35,56;GC=52708,0,0;Hom_AFR=0;Hom_AMR=0;Hom_ASJ=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;Hom_Male=0;Hom_Female=0;Hom_raw=56;Hom=0;POPMAX=.;AC_POPMAX=.;AN_POPMAX=.;AF_POPMAX=.;DP_MEDIAN=3;DREF_MEDIAN=6.34858e-07;GQ_MEDIAN=9;AB_MEDIAN=5.00000e-01;AS_RF=7.66183e-02;AS_FilterStatus=RF|AC0

Keeping all this in mind, I'd say the duplicated lines are artifacts from the liftover. Does anyone have any comments or insights on this?

chapmanb commented 5 years ago

Matthias and Sergey; Wow, thank you for this in-depth analysis. I'm agreed that the duplicates are likely liftover artifacts and like Sergey's practical way to avoid this. I'm not entirely sure 2.1 will avoid this since it will also be lifted over to 38, but I keep hearing that a native 38 build for gnomad is on the way so that will definitely avoid it. Fingers crossed that will be here soon.

matthdsm commented 5 years ago

A hg38 native GnomAD would be very nice indeed. In the mean time, I'm trying to lift the vcf's from broad myself using different chain files and executables. I'll let you know if something comes up.

Cheers M

naumenko-sa commented 5 years ago

Hi Matthias and Brad!

Thanks for the discussion - it is indeed an interesting story illustrating the importance of grch38 for variant calling.

Two sites in grch37 correspond to one site in grch38.

rs201672329 https://gnomad.broadinstitute.org/variant/10-47911549-A-G GRCh37: 10:47,911,549 (contig AL954360.3, gene FAM21B) → GRCh38: 10:50,095,648 (contig AL442003.8, gene WASHC2A)

NO_RS_ID GRCh37: 10:51,855,408 (contig AL442003.8, gene FAM21A) → GRCh38: 10:50,095,648 (contig AL442003.8, gene WASHC2A (=FAM21A=FAM21B))

In GRCh38 the gene is defined better: WASHC2A, two previous sequences (candidate genes FAM21A, FAM21B) were merged.

NCBI remap (grch38>grch37) https://www.ncbi.nlm.nih.gov/genome/tools/remap/ shows both sites in grch37 (and even the 3rd one on the PATCH).

If we thought of liftover as liftover_function(old_assembly, new_assembly), this example would become sooner an enhancement of the new assembly, rather than an artifact of the liftover function per se. It would be hard to blame liftover tools here. Of course, there might be another types of issues.

The story provides one more confirmation of grch38's superiority for variant calling. Misassembly in grch37 may lead to FP variants like NO_RS_ID. Also this story confirms the great work of GNOMAD team resulted in filtering out variants like NO_RS_ID.

Regarding the liftover improvement. I worked a bit with whole genome alignments (even for much smaller genomes), there might be a lot of technical difficulties even to apply liftover programs to large genomes. I see it as TCP/IP stack: there is assembly team, liftover team, gnomad team, Ensembl team, cloudbiolinux level, bcbio level, user level. I'm just happy that teams on the fundamental levels are doing such a great job. (Of course, not discouraging Matthias from doing a better liftover).

Regarding @matthdsm's questions:

We lost 44k records on chr10, 7%. here is the distribution of top FILTER values in the original file:

Filter	Count	Percent
PASS	474632	0.81
RF	52906	0.09
RF;AC0	38161	0.06
AC0	7283	0.01
SEGDUP	3018	0.005
LCR	2947	0.005
RF;LCR	1214	0.002

Very approximately (the number of records decreases by PASS filtration and increases after decomposition), -19% of records did not pass filters, and then decomposition gave +12% of variants.

I think it is fine to filter out these variants, I'd think I'd be in trouble from the clinical side if I included any of them in any type of analyses.

Our testing of grch38 is quite limited: currently, we have just one project with grch38 in the lab, so we are not using it much. But we want to be ready for switching. Currently, we are using vcfanno configs with 'first' for gnomad_exome, but I agree that 'self' might be better for advanced users to see the potential issues.

The main conclusion for me from this whole story is that gnomad frequencies in grch38, despite some minor issues, are usable.

Sergey

matthdsm commented 5 years ago

Hi Sergey,

Thanks for the very thorough analysis of the issue. Personally, I doubt I'd be able to do a "better" liftover than the team at Ensembl. The only thing better IMO would be a hg38 native version straight from Broad. I'll throw some CPU time at the annotation with different vcfanno options and I'll let you know if the self options work with the filtered gnomad file.

Thanks again M

FedericoComoglio commented 5 years ago

Hi everyone,

I'd like to re-open this issue, as we are running into the same error when building a gemini database with the latest bcbio-devel. The gnomad_exome.vcf.gz file contains duplicated entries

tabix gnomad_exome.vcf.gz chr1:146138417-146138417 | cut -f1-7
chr1    146138417       rs4068059       G       T       33908.6 PASS
chr1    146138417       rs4068059       G       A       33908.6 PASS
chr1    146138417       rs782126330     G       A       4058.4  PASS

The process aborts with

sqlalchemy.exc.InterfaceError: (sqlite3.InterfaceError) Error binding parameter 157 - probably unsupported type.
[SQL: INSERT INTO variants [...]

How should we go about this? Thank you!

Federico

roryk commented 5 years ago

Thanks, this has been fixed, see here for an explanation: https://github.com/bcbio/bcbio-nextgen/issues/2961#issuecomment-541210508

FedericoComoglio commented 5 years ago

Thanks @roryk, I will immediately upgrade.

FedericoComoglio commented 5 years ago

Upgrading to latest devel (-u development --data) does not seem to download any new dataset. Shall I try to remove gnomad* files from /path/to/data/genomes/Hsapiens/hg38/variation/ prior to that?

FedericoComoglio commented 5 years ago

Hi @roryk and @naumenko-sa,

while trying the fix the issue above, I removed gnomad* annotations from /path/to/data/genomes/Hsapiens/hg38/variation/. However, upgrading with -u development --data does not pull any gnomad file anymore. Do you know why this is the case, and how could we possibly address it?

Thanks a lot. Federico

matthdsm commented 5 years ago

Hi Frederico

You need to remove the gnomad reference in the ‘versions.csv’ file to trigger a download.

Cheers M

Op 16 okt. 2019 om 16:04 heeft Federico Comoglio notifications@github.com<mailto:notifications@github.com> het volgende geschreven:

Hi @rorykhttps://github.com/roryk and @naumenko-sahttps://github.com/naumenko-sa,

while trying the fix the issue above, I removed gnomad* annotations from /path/to/data/genomes/Hsapiens/hg38/variation/. However, upgrading with -u development --data does not pull any gnomad file anymore. Do you know why this is the case, and how could we possibly address it?

Thanks a lot. Federico

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bcbio/bcbio-nextgen/issues/2567?email_source=notifications&email_token=AC2NHECMO2Q3F57GEERU5WLQO4NOTA5CNFSM4GCDDMIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBMTLOY#issuecomment-542717371, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC2NHEHS6UPKTONN3S2ZJBLQO4NOTANCNFSM4GCDDMIA.

FedericoComoglio commented 5 years ago

Thanks @matthdsm. Downloading now.

FedericoComoglio commented 5 years ago

I'd like to follow up on this issue. I pulled the latest gnomad VCF and upgraded to bcbio v1.1.7 but the issue persists.

FedericoComoglio commented 5 years ago

Hi everyone, we weren't able to fix the issue in the meantime. We'd appreciate any pointer that could help us here. Thank you!

naumenko-sa commented 4 years ago

@FedericoComoglio sorry for not getting back to you, it looks we've made some progress on that issue here: https://github.com/bcbio/bcbio-nextgen/issues/3087