Closed nvteja closed 5 years ago
Teja; Thanks for the report and sorry about the issue. I'll have to bring in @brentp to help with this one. I see the problem, but not sure about the best solution. This is the failing VCF line:
1 152327436 . TGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC T 1538.1 PASS AC=1;AF=0.167;AN=6;BaseQRankSum=0.428;ClippingRankSum=0;DP=2838;ExcessHet=3.0103;FS=5.425;MLEAC=1;MLEAF=0.167;MQ=58.61;MQ0=0;MQRankSum=-5.847;QD=7.81;ReadPosRankSum=-0.91;SOR=1.273;CSQ=-|inframe_deletion|MODERATE|FLG2|ENSG00000143520|Transcript|ENST00000388718|protein_coding|3/3||ENST00000388718.5:c.2595_2825del|ENSP00000373370.4:p.Ser869_Thr945del|2668-2898/9124|2595-2825/7176|865-942/2391|SGQTSGFGQHRSSSGQYSGFGQHGSGSGQSSGFGQHGTGSGQYSGFGQHESRSHQSSYGQHGSGSSQSSGYGQHGSSS/S|tcGGGACAGACATCTGGCTTTGGACAACACAGGTCAAGCTCAGGTCAATACTCTGGCTTTGGACAACATGGATCAGGCTCAGGTCAGTCCAGTGGCTTTGGACAACATGGGACTGGCTCAGGACAATACTCTGGTTTTGGACAACATGAGTCTAGATCACATCAGTCTAGCTATGGCCAACATGGTTCTGGCTCAAGTCAGTCATCTGGCTATGGTCAACATGGGTCAAGTTCa/tca||1||-1||deletion|HGNC|33276|YES|||CCDS30861.1|ENSP00000373370|Q5D862||UPI00004E1DE5||Ensembl|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|||||hmmpanther:PTHR22571&hmmpanther:PTHR22571:SF24&Low_complexity_(Seg):seg&Low_complexity_(Seg):seg&Low_complexity_(Seg):seg||||||||||||||||||||||||||||||||,-|intron_variant&non_coding_transcript_variant|MODIFIER|FLG-AS1|ENSG00000237975|Transcript|ENST00000392688|antisense||6/6|ENST00000392688.2:n.1407-8994_1407-8764del||||||||1||1||deletion|HGNC|27913|YES|||||||||Ensembl|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC||||||32|||||||||||||||||||||||||||||||,-|intron_variant&non_coding_transcript_variant|MODIFIER|FLG-AS1|ENSG00000237975|Transcript|ENST00000445097|antisense||2/2|ENST00000445097.1:n.152-8994_152-8764del||||||||1||1||deletion|HGNC|27913||||||||||Ensembl|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC||||||32|||||||||||||||||||||||||||||||,-|inframe_deletion|MODERATE|FLG2|388698|Transcript|NM_001014342.2|protein_coding|3/3||NM_001014342.2:c.2595_2825del|NP_001014364.1:p.Ser869_Thr945del|2668-2898/9122|2595-2825/7176|865-942/2391|SGQTSGFGQHRSSSGQYSGFGQHGSGSGQSSGFGQHGTGSGQYSGFGQHESRSHQSSYGQHGSGSSQSSGYGQHGSSS/S|tcGGGACAGACATCTGGCTTTGGACAACACAGGTCAAGCTCAGGTCAATACTCTGGCTTTGGACAACATGGATCAGGCTCAGGTCAGTCCAGTGGCTTTGGACAACATGGGACTGGCTCAGGACAATACTCTGGTTTTGGACAACATGAGTCTAGATCACATCAGTCTAGCTATGGCCAACATGGTTCTGGCTCAAGTCAGTCATCTGGCTATGGTCAACATGGGTCAAGTTCa/tca||1||-1||deletion|EntrezGene|33276|YES||||NP_001014364.1|||||RefSeq|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|OK||||||||||||||||||||||||||||||||||||,-|intron_variant&non_coding_transcript_variant|MODIFIER|FLG-AS1|339400|Transcript|NR_103778.1|misc_RNA||6/6|NR_103778.1:n.1407-8994_1407-8764del||||||||1||1||deletion|EntrezGene|27913|YES|||||||||RefSeq|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC||||||32|||||||||||||||||||||||||||||||,-|intron_variant&non_coding_transcript_variant|MODIFIER|FLG-AS1|339400|Transcript|NR_103779.1|misc_RNA||2/2|NR_103779.1:n.152-8994_152-8764del||||||||1||1||deletion|EntrezGene|27913||||||||||RefSeq|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC|GAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC||||||32|||||||||||||||||||||||||||||||;gnomad_ac_es=2;gnomad_hom_es=0;gnomad_af_es=0.00011256;gnomad_an_es=245514,225528;gnomad_ac_gs=78;gnomad_af_gs=0.016779;gnomad_an_gs=30504;gnomad_ac=80;gnomad_hom=0;gnomad_af_popmax=0.016779;gnomad_an=30504;gnomad_af=0.0026226 GT:AD:DP:GQ:PL 0/1:34,163:197:99:1578,0,668 0/0:539,108:647:99:0,690,21261 0/0:1370,6:1376:99:0,4080,57462
and it looks like the issue is that gnomad_an_es
is a tuple of two values, rather than an expected single value (245514, 225528)
. Brent, is there a way to handle this case in vcf2db, or should we be trying to work around with our vcfanno recipe? Thanks for any thoughts/suggestions about how best to tackle.
was vcfanno run on this before it was decomposed? and/or what were the parameters used to run vcfanno? I don't see how this would be getting annotated with 2 values.
Brent and Teja; I did some GitHub searching and realized this is probably coming from this custom vcfanno script:
https://github.com/naumenko-sa/cre/blob/8fe4efc63b0ce2ad0825f547daa25bf9c5b69f81/cre.vcfanno.conf#L4
My suggestion would be to swap that to use first
rather than self
for the ops
which should only annotate with a single count for this case (where gnomad apparently has multiple alleles). Would that be a reasonable workaround?
Thank you for the reply @chapmanb and @brentp
Based on the command log, it looks like the vcf was decomposed before vcfanno. I am attaching the command log and vcfanno conf here (I shortened file paths in the commands for readability).
bcbio-nextgen-commands.log cre.vcfanno.lua.txt
Thank you, Teja
Oh Sorry @chapmanb , just saw your response. I will try your suggestion and will let you know if it works. Also, tagging my colleague @naumenko-sa
hi @nvteja , based on your logs, I would add sed -e 's/Number=A/Number=1/g'
to change your VCF before running vcfanno (and after decomposing).
Thank you all!
The source of the problem seems to be that in the gnomad Exome vcf bcbio/gemini_data/gnomad.exomes.r2.0.1.sites.no-VEP.nohist.tidy.vcf.gz there are two identical (chr-pos-alt-ref) variants at 1:152327436 (the first and the third). And they have different AN values in the INFO tags. That is why vcfanno returns a tuple. Probably it is gnomad.exomes bug - because we don't expect duplicates of the same allele after decomposition and normalization?
1 152327436 . TGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC T 4.02656e+07 PASS AC=8125;AF=0.0330938;AN=245514;BaseQRankSum=1.9;ClippingRankSum=0.048;DP=25137064;FS=0;InbreedingCoeff=0.7984;MQ=56.61;MQRankSum=1.25;QD=20.29;ReadPosRankSum=0.985;SOR=0.704;VQSLOD=2.18;VQSR_culprit=FS;AC_AFR=12;AC_AMR=2398;AC_ASJ=404;AC_EAS=2473;AC_FIN=124;AC_NFE=1160;AC_OTH=198;AC_SAS=1356;AC_Male=4221;AC_Female=3904;AN_AFR=15292;AN_AMR=33356;AN_ASJ=9826;AN_EAS=17098;AN_FIN=22292;AN_NFE=111504;AN_OTH=5466;AN_SAS=30680;AN_Male=134540;AN_Female=110974;AF_AFR=0.000784724;AF_AMR=0.0718911;AF_ASJ=0.0411154;AF_EAS=0.144637;AF_FIN=0.00556253;AF_NFE=0.0104032;AF_OTH=0.0362239;AF_SAS=0.0441982;AF_Male=0.0313736;AF_Female=0.0351794;AC_raw=8495;AN_raw=246272;AF_raw=0.0344944;Hom_AFR=4;Hom_AMR=1015;Hom_ASJ=188;Hom_EAS=1042;Hom_FIN=59;Hom_NFE=485;Hom_OTH=82;Hom_SAS=584;Hom_Male=1804;Hom_Female=1655;Hom_raw=3524;Hom=3459;POPMAX=EAS;AC_POPMAX=2473;AN_POPMAX=17098;AF_POPMAX=0.144637;DP_MEDIAN=190;DREF_MEDIAN=0;GQ_MEDIAN=99;AB_MEDIAN=0.851619;AS_RF=0.934857;AS_FilterStatus=PASS;OLD_MULTIALLELIC=1:152327436:TGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC/T/CGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC
1 152327436 . T C 4.02656e+07 PASS AC=0;AF=0;AN=245514;BaseQRankSum=1.9;ClippingRankSum=0.048;DP=25137064;FS=0;InbreedingCoeff=0.7984;MQ=56.61;MQRankSum=1.25;QD=20.29;ReadPosRankSum=0.985;SOR=0.704;VQSLOD=2.18;VQSR_culprit=FS;AC_AFR=0;AC_AMR=0;AC_ASJ=0;AC_EAS=0;AC_FIN=0;AC_NFE=0;AC_OTH=0;AC_SAS=0;AC_Male=0;AC_Female=0;AN_AFR=15292;AN_AMR=33356;AN_ASJ=9826;AN_EAS=17098;AN_FIN=22292;AN_NFE=111504;AN_OTH=5466;AN_SAS=30680;AN_Male=134540;AN_Female=110974;AF_AFR=0;AF_AMR=0;AF_ASJ=0;AF_EAS=0;AF_FIN=0;AF_NFE=0;AF_OTH=0;AF_SAS=0;AF_Male=0;AF_Female=0;AC_raw=5;AN_raw=246272;AF_raw=2.03028e-05;Hom_AFR=0;Hom_AMR=0;Hom_ASJ=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;Hom_Male=0;Hom_Female=0;Hom_raw=0;Hom=0;POPMAX=.;AC_POPMAX=.;AN_POPMAX=.;AF_POPMAX=.;DP_MEDIAN=210;DREF_MEDIAN=1.5849e-40;GQ_MEDIAN=99;AB_MEDIAN=0.108696;AS_RF=0.164612;AS_FilterStatus=AC0;OLD_MULTIALLELIC=1:152327436:TGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC/T/CGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC;OLD_VARIANT=1:152327436:TGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC/CGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC
1 152327436 rs76514540 TGAACTTGACCCATGTTGACCATAGCCAGATGACTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCC T 1.45778e+08 PASS AC=2;AF=8.86808e-06;AN=225528;BaseQRankSum=-2.754;ClippingRankSum=0.003;DP=32448557;FS=0;InbreedingCoeff=0.1146;MQ=58.42;MQRankSum=-0.275;QD=10.94;ReadPosRankSum=-0.384;SOR=0.721;VQSLOD=4.72;VQSR_culprit=FS;AC_AFR=0;AC_AMR=0;AC_ASJ=1;AC_EAS=0;AC_FIN=0;AC_NFE=1;AC_OTH=0;AC_SAS=0;AC_Male=1;AC_Female=1;AN_AFR=15028;AN_AMR=29174;AN_ASJ=8884;AN_EAS=14190;AN_FIN=21414;AN_NFE=104486;AN_OTH=4962;AN_SAS=27390;AN_Male=123716;AN_Female=101812;AF_AFR=0;AF_AMR=0;AF_ASJ=0.000112562;AF_EAS=0;AF_FIN=0;AF_NFE=9.57066e-06;AF_OTH=0;AF_SAS=0;AF_Male=8.08303e-06;AF_Female=9.82202e-06;AC_raw=2;AN_raw=244696;AF_raw=8.17341e-06;Hom_AFR=0;Hom_AMR=0;Hom_ASJ=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;Hom_Male=0;Hom_Female=0;Hom_raw=0;Hom=0;STAR_AC=8070;STAR_AC_raw=8437;STAR_Hom=3456;POPMAX=ASJ;AC_POPMAX=1;AN_POPMAX=8884;AF_POPMAX=0.000112562;DP_MEDIAN=71;DREF_MEDIAN=2.50581e-05;GQ_MEDIAN=71;AB_MEDIAN=0.500379;AS_RF=0.248196;AS_FilterStatus=PASS;OLD_MULTIALLELIC=1:152327469:CTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCCGAACTTGACCCATGTTGACCATAGCCAGATGAT/TTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCCGAACTTGACCCATGTTGACCATAGCCAGATGAT/ATGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCCGAACTTGACCCATGTTGACCATAGCCAGATGAT/T;OLD_VARIANT=1:152327469:CTGACTTGAGCCAGAACCATGTTGGCCATAGCTAGACTGATGTGATCTAGACTCATGTTGTCCAAAACCAGAGTATTGTCCTGAGCCAGTCCCATGTTGTCCAAAGCCACTGGACTGACCTGAGCCTGATCCATGTTGTCCAAAGCCAGAGTATTGACCTGAGCTTGACCTGTGTTGTCCAAAGCCAGATGTCTGTCCCGAACTTGACCCATGTTGACCATAGCCAGATGAT/T
Using first instead of self in the vcfanno config solves the problem, thanks!
Sergey
Sergey, thanks for figuring that out! It was stumping me how it could happen as I only noticed the first 2 variants in gnomad and didn't see that the long deletion was repeated.
Hi Brent, Brad
I'm running into a similar issue. Is it possible to provide some kind of default to catch this error? In attachment you can find the "problem" data. My guess is it's also coming from the gnomad dataset.
I'm getting this error:
/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:255: SAWarning: Unicode type received non-unicode bind param value ''. (this warning may be suppressed after 10 occurrences)
(util.ellipses_string(value),),
/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:255: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
(util.ellipses_string(value),),
Traceback (most recent call last):
File "/home/galaxy/bcbio/anaconda/bin/vcf2db.py", line 924, in <module>
impacts_extras=a.impacts_field, aok=a.a_ok)
File "/home/galaxy/bcbio/anaconda/bin/vcf2db.py", line 234, in __init__
self.load()
File "/home/galaxy/bcbio/anaconda/bin/vcf2db.py", line 319, in load
i = self._load(self.cache, create=True, start=1)
File "/home/galaxy/bcbio/anaconda/bin/vcf2db.py", line 312, in _load
self.insert(variants, expanded, keys, i, create=create)
File "/home/galaxy/bcbio/anaconda/bin/vcf2db.py", line 374, in insert
vilengths, variant_impacts)
File "/home/galaxy/bcbio/anaconda/bin/vcf2db.py", line 402, in _insert
self.__insert(v_objs, self.metadata.tables['variants'].insert())
File "/home/galaxy/bcbio/anaconda/bin/vcf2db.py", line 444, in __insert
trans.execute(stmt, o)
File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 980, in execute
return meth(self, multiparams, params)
File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 273, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1099, in _execute_clauseelement
distilled_params,
File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1240, in _execute_context
e, statement, parameters, cursor, context
File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1458, in _handle_dbapi_exception
util.raise_from_cause(sqlalchemy_exception, exc_info)
File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 296, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1236, in _execute_context
cursor, statement, parameters, context
File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 536, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.InterfaceError: (sqlite3.InterfaceError) Error binding parameter 108 - probably unsupported type. [SQL: u'INSERT INTO variants (variant_id, chrom, start, "end", vcf_id, ref, alt, qual, filter, type, sub_type, call_rate, num_hom_ref, num_het, num_hom_alt, num_unknown, aaf, gene, ensembl_gene_id, transcript, is_exonic, is_coding, is_lof, is_splicing, is_canonical, exon, codon_change, aa_change, aa_length, biotype, impact, impact_so, impact_severity, polyphen_pred, polyphen_score, sift_pred, sift_score, ac, af, an, baseqranksum, clippingranksum, db, decomposed, dp, ds, excesshet, fs, mleac, mleaf, mq, mq0, mqranksum, qd, readposranksum, sor, ac_eog, ac_adj_exac_afr, ac_adj_exac_amr, ac_adj_exac_eas, ac_adj_exac_fin, ac_adj_exac_nfe, ac_adj_exac_oth, ac_adj_exac_sas, ac_exac_all, af_adj_exac_afr, af_adj_exac_amr, af_adj_exac_eas, af_adj_exac_fin, af_adj_exac_nfe, af_adj_exac_oth, af_adj_exac_sas, af_esp_aa, af_esp_all, af_esp_ea, af_exac_all, an_eog, an_adj_exac_afr, an_adj_exac_amr, an_adj_exac_eas, an_adj_exac_fin, an_adj_exac_nfe, an_adj_exac_oth, an_adj_exac_sas, an_exac_all, common_pathogenic, gc_het_alt_eog, gc_hom_alt_eog, gc_hom_ref_eog, gnomad_ac, gnomad_ac_amr, gnomad_ac_asj, gnomad_ac_eas, gnomad_ac_fin, gnomad_ac_nfe, gnomad_ac_oth, gnomad_ac_popmax, gnomad_ac_sas, gnomad_af, gnomad_af_afr, gnomad_af_amr, gnomad_af_asj, gnomad_af_eas, gnomad_af_fin, gnomad_af_nfe, gnomad_af_oth, gnomad_af_popmax, gnomad_af_sas, gnomad_an, gnomad_an_amr, gnomad_an_asj, gnomad_an_eas, gnomad_an_fin, gnomad_an_nfe, gnomad_an_oth, gnomad_an_popmax, gnomad_an_sas, gnomad_gc, gnomad_gc_female, gnomad_gc_male, gnomad_hom, gnomad_hom_female, gnomad_hom_male, gnomad_popmax, gnomad_exomes_af, gnomad_genomes_af, max_aaf_all, num_exac_het, num_exac_hom, rs_ids, allele, feature_type, intron, hgvsc, hgvsp, cdna_position, cds_position, existing_variation, allele_num, distance, strand, flags, variant_class, symbol_source, hgnc_id, tsl, appris, ccds, ensp, swissprot, trembl, uniparc, refseq_match, source, given_ref, used_ref, bam_edit, gene_pheno, domains, hgvs_offset, afr_af, amr_af, eas_af, eur_af, sas_af, aa_af, ea_af, gnomad_afr_af, gnomad_amr_af, gnomad_asj_af, gnomad_eas_af, gnomad_fin_af, gnomad_nfe_af, gnomad_oth_af, gnomad_sas_af, max_af, max_af_pops, clin_sig, somatic, pheno, pubmed, motif_name, motif_pos, high_inf_pos, motif_score_change, lof, lof_filter, lof_flags, lof_info, maxentscan_alt, maxentscan_diff, maxentscan_ref, spliceregion, gts, gt_types, gt_phases, gt_depths, gt_ref_depths, gt_alt_depths, gt_quals, gt_alt_freqs) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters: (1, u'chr10', 50099997, 50099998, u'rs61856727', u'A', u'G', 862.7999877929688, None, u'snp', 'ts', 1.0, 0, 1, 0, 0, 0.5, u'WASHC2A', None, u'ENST00000282633', 1, 1, 0, 0, 1, u'17/31', u'aaA/aaG', u'K', u'523/1341', u'protein_coding', u'synonymous_variant', u'synonymous_variant', 'LOW', u'', None, u'', None, 1, u'', 2, -1.9780000448226929, 0.0, 1, 0, 93, 0, 3.0102999210357666, 5.3460001945495605, 1, 0.5, 43.959999084472656, 0, 0.4519999921321869, 9.279999732971191, 1.2170000076293945, 1.3240000009536743, u'A:1111&G:719', 2398.0, 3311.0, 1240.0, 2809.0, 22960.0, 312.0, 4281.0, 37311.0, 0.2791000008583069, 0.29339998960494995, 0.1525000035762787, 0.43689998984336853, 0.367000013589859, 0.3652999997138977, 0.27239999175071716, -1.0, -1.0, -1.0, 0.32850000262260437, 1830.0, 8592.0, 11286.0, 8130.0, 6430.0, 62556.0, 854.0, 15716.0, 113564.0, 0, 523, 98, 294, 20896, 3098, 452, 949, 4318, 9166, 504, 4318, 1344, u'0.3182', 0.19850000739097595, 0.3197999894618988, 0.3368000090122223, 0.12380000203847885, 0.4163999855518341, 0.3831000030040741, 0.329800009727478, 0.4163999855518341, 0.2320999950170517, (187800, 65674), (22212, 9688), (7034, 1342), (11932, 7664), (19040, 10370), (88556, 23926), (4012, 1528), 10370, (23084, 5790), 126737.0, 58367.0, 68370.0, 855, 449, 406, u'FIN', -1.0, -1.0, 0.43689998984336853, 31999.0, 2656.0, u'rs61856727', u'G', u'Transcript', u'', u'ENST00000282633.9:c.1569A>G', u'ENSP00000282633.5:p.Lys523%3D', u'1614/4272', u'1569/4026', u'rs61856727', u'1', u'', u'1', u'', u'SNV', u'HGNC', u'HGNC:23416', u'1', u'P3', u'CCDS41527.1', u'ENSP00000282633', u'Q641Q2', u'', u'UPI000044FEAB', u'', u'Ensembl', u'A', u'A', u'', u'', u'hmmpanther:PTHR21669&hmmpanther:PTHR21669:SF24', u'', u'', u'', u'', u'', u'', u'', u'', u'0.1985', u'0.3198', u'0.3368', u'0.1238', u'0.4164', u'0.3831', u'0.3298', u'0.2321', u'0.4164', u'gnomAD_FIN', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'DE_NOVO_DONOR_PROB:0.0794222613266201&EXON_END:50100064&MUTANT_DONOR_MES:8.97796800884188&DE_NOVO_DONOR_MES_POS:-71&DE_NOVO_DONOR_MES:-1.85392529798293&INTRON_END:50104041&EXON_START:50099978&INTRON_START:50100065&DE_NOVO_DONOR_POS:-71', u'', u'', u'', u'', <read-only buffer for 0x7f8ae44084e0, size -1, offset 0 at 0x7f8ae37d3570>, <read-only buffer for 0x7f8ae4408450, size -1, offset 0 at 0x7f8ae37d35b0>, <read-only buffer for 0x7f8ae4408540, size -1, offset 0 at 0x7f8ae37d35f0>, <read-only buffer for 0x7f8ae44085d0, size -1, offset 0 at 0x7f8ae37d3630>, <read-only buffer for 0x7f8ae4408630, size -1, offset 0 at 0x7f8ae37d3670>, <read-only buffer for 0x7f8ae44086c0, size -1, offset 0 at 0x7f8ae37d36b0>, <read-only buffer for 0x7f8ae4408720, size -1, offset 0 at 0x7f8ae37d36f0>, <read-only buffer for 0x7f8ae4408780, size -1, offset 0 at 0x7f8ae37d3730>)] (Background on this error at: http://sqlalche.me/e/rvf5)
And here's the data: vcf2db_error.zip
Thanks for the help.
Cheers M
I've identified multiple variants causing issues when generating the db:
rs35465208
rs61856727
rs4076899
rs367813564
rs59195224
rs2072564
rs773222534
rs11669847
rs11669854
rs11669884
rs61744491
rs61737632
rs762214476
rs143680639
Decomposing and normalizing the gnomad dataset doesn't fix the problem. Anyone have any ideas?
Thanks M
the solution is to just use "first". vcfanno tries to do the right thing for so many cases, for this case, where a variant is repeated in gnomad, it's easy enough to use "first"
This issue looks quite similar to the one I hit in issue #2635. Looking at @matthdsm's error, it fails because the following fields have two values instead of one in gnomAD's VCFs:
gnomad_an
, gnomad_an_amr
, gnomad_an_asj
, gnomad_an_eas
, gnomad_an_fin
, gnomad_an_nfe
, gnomad_an_oth
, gnomad_an_popmax
, gnomad_an_sas
The field gnomad_an
is parameter 108
in the SQAlchemy error, and it has the values (187800, 65674)
. In gnomAD's page for this particular variant, we see that 65674
is the alllele number with this variant, while 187800
is probably the total allele number (just a guess).
So, @brentp's suggestion of using first
instead of self
might not be the best option here, as it would give the total allele number out of all gnomAD's exomes and genomes, and not the allele number with our variant of interest.
gnomAD's VCF header says this field should have one value only:
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in call
ed genotypes">
Not sure exactly what the best option here is, just a warning that, while using first
in vcfanno would fix the error, it might give an incorrect result.
-- Paulo
I fully agree with Paulo. Just grabbing the first entry seems a tad dangerous to me. We could be missing out on just that one piece of data we're looking for. If we HAVE to select one, I'd suggest selecting the one with the lowest AF (e.g. the rarest one), so we can at least catch those.
Thoughts?
Thanks M
Hi Brent, Just a question. I'm getting this warning
WARNING: using op 'self' when with Number='1' for 'AC_ACR' from '/data/gent/vo/000/gvo00082/vsc41443/gnomad/variation/tmp/gnomad.exomes.r2.0.1.sites.GRCh38.noVEP.vcf.gz' can result in out-of-order values when the query is multi-allelic
api.go:805: this is not an issue if the query has been decomposed.
Could you explain a bit please? Does this mean that vcfanno will extract the correct data from the "source" file (gnomad in this case) if the "query" (the file you're annotating) is decomposed? Or am I interpreting this wrong?
I just opened a PR #2673, which fixes some errors in the field names. I tried running the pipeline with several configs and found the following:
AN
and GC
fieldsIf you're okay with this, I'll add the modifications to my PR.
Cheers M
yeah. I found that the latest vcfs are already decomposed and normalized.
Hi Brent,
Thanks for chiming in. Could you specify what you mean by latest? For example, bcbio uses v2.0.1, lifted over by Ensembl. Is this one of the versions already preprocessed?
Thanks M
I don't know about 2.0.1, but 2.1 is preprocessed.
Sqlalchemy.exc.InterfaceError: (sqlite3.InterfaceError) Error binding parameter 110 - probably unsupported type. [SQL: u'INSERT INTO variants (variant_id, chrom, start, "end", vcf_id, ref, alt, qual, filter, type, sub_type, call_rate, num_hom_ref, num_het, num_hom_alt, num_unknown, aaf, gene, ensembl_gene_id, transcript, is_exonic, is_coding, is_lof, is_splicing, is_canonical, exon, codon_change, aa_change, aa_length, biotype, impact, impact_so, impact_severity, polyphen_pred, polyphen_score, sift_pred, sift_score, ac, af, an, baseqranksum, clippingranksum, db, decomposed, dp, ds, excesshet, fs, len, lof, mleac, mleaf, mq, mq0, mqranksum, nmd, old_multiallelic, old_variant, qd, readposranksum, sor, ac_adj_exac_afr, ac_adj_exac_amr, ac_adj_exac_eas, ac_adj_exac_fin, ac_adj_exac_nfe, ac_adj_exac_oth, ac_adj_exac_sas, ac_exac_all, af_adj_exac_afr, af_adj_exac_amr, af_adj_exac_eas, af_adj_exac_fin, af_adj_exac_nfe, af_adj_exac_oth, af_adj_exac_sas, af_esp_aa, af_esp_all, af_esp_ea, af_exac_all, an_adj_exac_afr, an_adj_exac_amr, an_adj_exac_eas, an_adj_exac_fin, an_adj_exac_nfe, an_adj_exac_oth, an_adj_exac_sas, an_exac_all, clinvar_disease_name, clinvar_sig, common_pathogenic, gnomad_ac, gnomad_ac_amr, gnomad_ac_asj, gnomad_ac_eas, gnomad_ac_fin, gnomad_ac_nfe, gnomad_ac_oth, gnomad_ac_popmax, gnomad_ac_sas, gnomad_af, gnomad_af_afr, gnomad_af_amr, gnomad_af_asj, gnomad_af_eas, gnomad_af_fin, gnomad_af_nfe, gnomad_af_oth, gnomad_af_popmax, gnomad_af_sas, gnomad_an, gnomad_an_amr, gnomad_an_asj, gnomad_an_eas, gnomad_an_fin, gnomad_an_nfe, gnomad_an_oth, gnomad_an_popmax, gnomad_an_sas, gnomad_gc, gnomad_gc_female, gnomad_gc_male, gnomad_hom, gnomad_hom_female, gnomad_hom_male, gnomad_popmax, max_aaf_all, num_exac_het, num_exac_hom, rs_ids, gts, gt_types, gt_phases, gt_depths, gt_ref_depths, gt_alt_depths, gt_quals, gt_alt_freqs) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters: (441616, u'chr10', 50095647, 50095648, u'rs1273688830', u'A', u'G', 71.5999984741211, None, 'snp', 'ts', 1.0, 0, 1, 0, 0, 0.5, u'FAM21A', None, u'ENST00000282633.9', 1, 1, 0, 0, 0, u'15/31', u'c.1290A>G', u'p.Pro430Pro', 1341, u'protein_coding', u'synonymous_variant', u'synonymous_variant', 'LOW', None, None, None, None, 1, 0.5, 2, -0.6740000247955322, 0.0, 1, 0, 4, 0, 3.0102999210357666, 0.0, None, 'None', 1, 0.5, 57.65999984741211, 0, -0.3190000057220459, 'None', None, 'None', 17.899999618530273, 0.0, 0.9160000085830688, 408.0, 99.0, 120.0, 153.0, 1206.0, 10.0, 231.0, 2227.0, 0.11620000004768372, 0.02759999968111515, 0.03550000116229057, 0.08110000193119049, 0.05570000037550926, 0.03359999880194664, 0.03060000017285347, 0.06895402818918228, 0.04439403861761093, 0.032608695328235626, 0.053199999034404755, 3512.0, 3592.0, 3382.0, 1886.0, 21650.0, 298.0, 7554.0, 41874.0, None, None, 0, 6331, 537, 312, 304, 786, 2850, 133, 894, 515, 0.04010000079870224, 0.09629999846220016, 0.023099999874830246, 0.042399998754262924, 0.025100000202655792, 0.05209999904036522, 0.04259999841451645, 0.0364999994635582, 0.09629999846220016, 0.02539999969303608, (105416, 157826), (12922, 23204), (4264, 7354), (6672, 12128), (11754, 15082), (48900, 66842), (2440, 3646), 9284, (13374, 20286), 131621.0, 59890.0, 71731.0, 1493, 679, 814, u'AFR', 0.11620000004768372, 1479.0, 374.0, u'rs1273688830', <read-only buffer for 0x7fd82e3a91e0, size -1, offset 0 at 0x7fd8280e68b0>, <read-only buffer for 0x7fd82e3a91b0, size -1, offset 0 at 0x7fd8280e6870>, <read-only buffer for 0x7fd82e3a9210, size -1, offset 0 at 0x7fd8280e6830>, <read-only buffer for 0x7fd82e3a9240, size -1, offset 0 at 0x7fd8280e67f0>, <read-only buffer for 0x7fd82e3a9270, size -1, offset 0 at 0x7fd8280e67b0>, <read-only buffer for 0x7fd82e3a92a0, size -1, offset 0 at 0x7fd8280e6770>, <read-only buffer for 0x7fd82e3a92d0, size -1, offset 0 at 0x7fd8280e6670>, <read-only buffer for 0x7fd82e3a9300, size -1, offset 0 at 0x7fd8280e6730>)] (Background on this error at: http://sqlalche.me/e/rvf5)
I'm still seeing these errors. This one looks like gnomAD_AF_SAS
which is a tuple of values and not a single value.
The vcfanno config looks like it is set for self:
[[annotation]]
file="/n/app/bcbio/dev/genomes/Hsapiens/hg38/variation/gnomad_exome.vcf.gz"
fields=["AF","AF_AFR","AF_AMR","AF_ASJ","AF_EAS","AF_FIN","AF_NFE","AF_OTH","AF_SAS","AF_POPMAX","AC","AC_ACR","AC_AMR","AC_ASJ","AC_EAS","AC_FIN","AC_NFE","AC_OTH","AC_SAS","AC_POPMAX","AN","AN_ANR","AN_AMR","AN_ASJ","AN_EAS","AN_FIN","AN_NFE","AN_OTH","AN_SAS","AN_POPMAX","POPMAX","GC","GC_Male","GC_Female","Hom","Hom_Male","Hom_Female"]
names=["gnomAD_AF","gnomAD_AF_AFR","gnomAD_AF_AMR","gnomAD_AF_ASJ","gnomAD_AF_EAS","gnomAD_AF_FIN","gnomAD_AF_NFE","gnomAD_AF_OTH","gnomAD_AF_SAS","gnomAD_AF_POPMAX","gnomAD_AC","gnomAD_AC_ACR","gnomAD_AC_AMR","gnomAD_AC_ASJ","gnomAD_AC_EAS","gnomAD_AC_FIN","gnomAD_AC_NFE","gnomAD_AC_OTH","gnomAD_AC_SAS","gnomAD_AC_POPMAX","gnomAD_AN","gnomAD_AN_ANR","gnomAD_AN_AMR","gnomAD_AN_ASJ","gnomAD_AN_EAS","gnomAD_AN_FIN","gnomAD_AN_NFE","gnomAD_AN_OTH","gnomAD_AN_SAS","gnomAD_AN_POPMAX","gnomAD_POPMAX","gnomAD_GC","gnomAD_GC_Male","gnomAD_GC_Female","gnomAD_Hom","gnomAD_Hom_Male","gnomAD_Hom_Female"]
ops=["self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","self","sum","sum","sum","self","self","self"]
Could someone point me in the right direction for further debugging? Thanks!
Hi @roryk!
Probably, it is because gnomad_exome.vcf.gz has two identical records at chr10:50095648
tabix gnomad_exome.vcf.gz chr10:50095647-50095648 | cut -f 1-5
chr10 50095648 . A G
chr10 50095648 rs201672329 A G
I've added vt uniq to the ggd recipe of gnomad_exome.vcf.gz in hg38. https://github.com/chapmanb/cloudbiolinux/pull/286. Please let me know if that would fix the problem for you, I'll then fix other vcf recipes.
Sergey
Hi @naumenko-sa , @roryk
I'm not really a fan of this... seems like an even worse solution than using the "first" op in vcfanno. When using uniq
you're essentially removing variants based on the order they appear. When annotating this will cause us to miss all of those special cases..
Wouldn't it make more sense to awk
away all records with an empty ID
field? I've noticed all problematic variants have one record with an rsID and one without.
M
@roryk
As for you debug question, I don't know what you're looking for exactly, but I'd suggest changing the ops
for your problematic field to min
, in compliance with the AN_*
fields. That way, you're sure you're annotating with the rarest variant.
M
Matthias, Sergey and Rory; Apologies, I'd just merged Sergey's fix ahead of your comments. We can revert that and update the vcfanno configuration either if that makes more sense. I don't have a lot of experience with these special cases in practice so happy to defer to whatever the consensus is.
Hi Brad, I'm happy to with the consensus, but outright removing data seems a tad harsh to me. This also removes to possibility for users to tweak the vcfanno config to whatever use case they need. I'd rather keep the data as complete as possible and filter unneeded data, than deleting it.
Thanks M
@matthdsm you recently mentioned using the raw Ensembl gnomAD VCFs (GRCh37 and hg38) for annotating directly with vcfanno. Would this issue also show up when using those files?
Yes, the issue remains the same. I only mentioned that because when using the raw ensembl file, we'd be able to cut down on processing time during the installation. I didn't notice any differences between the "raw" and "processed" gnomad exome file. M
Hello everyone!
Thanks for the discussion, it helped me to learn a lot more about the issue!
# SN [2]id [3]key [4]value
SN 0 number of samples: 0
SN 0 number of records: 665745
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 619617
SN 0 number of MNPs: 0
SN 0 number of indels: 46128
SN 0 number of others: 0
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
There is no issue with duplicated chr10 variant in it:
tabix gnomad.exomes.r2.1.sites.chr10.vcf.bgz chr10:50095648-50095648
<EMPTY OUTPUT>
Ensembl's gnomad_exome is v2.0.1, it is not sorted, not decomposed, and not normalized, and it has an issue with duplicated variants (bcftools stats for chr10):
# SN [2]id [3]key [4]value
SN 0 number of samples: 0
SN 0 number of records: 583100
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 554086
SN 0 number of MNPs: 66
SN 0 number of indels: 40054
SN 0 number of others: 338
SN 0 number of multiallelic sites: 66250
SN 0 number of multiallelic SNP sites: 52822
Issue with duplicated variant:
tabix gnomad.exomes.r2.0.1.sites.GRCh38.noVEP.sorted.chr10.vcf.gz chr10:50095648-50095648
chr10 50095648 . A G 7379.2 RF;AC0 AC=0;AF=0;AN=105416;BaseQRankSum=0;ClippingRankSum=0.196;DP=2665618;FS=74.3;InbreedingCoeff=0.0209;MQ=23.95;MQRankSum=0;QD=6.76;ReadPosRankSum=0.358;SOR=4.733;VQSLOD=-221.9;VQSR_culprit=FS;GQ_HIST_ALT=2|44|9|12|2|2|0|4|6|4|0|0|4|1|0|1|0|0|0|0;DP_HIST_ALT=65|22|0|0|0|1|0|2|1|0|0|0|0|0|0|0|0|0|0|0;AB_HIST_ALT=0|0|3|1|1|1|0|1|4|0|9|0|5|6|0|2|1|1|0|0;GQ_HIST_ALL=5197|7001|4213|8918|7852|3162|5118|4039|1689|2849|2482|1151|3003|499|1655|845|2133|299|2102|15845;DP_HIST_ALL=10460|13434|12411|7664|6367|5948|6206|6267|3998|2901|2040|1316|633|275|85|29|9|4|1|1;AB_HIST_ALL=0|0|3|1|1|1|0|1|4|0|9|0|5|6|0|2|1|1|0|0;AC_AFR=0;AC_AMR=0;AC_ASJ=0;AC_EAS=0;AC_FIN=0;AC_NFE=0;AC_OTH=0;AC_SAS=0;AC_Male=0;AC_Female=0;AN_AFR=5090;AN_AMR=12922;AN_ASJ=4264;AN_EAS=6672;AN_FIN=11754;AN_NFE=48900;AN_OTH=2440;AN_SAS=13374;AN_Male=57722;AN_Female=47694;AF_AFR=0;AF_AMR=0;AF_ASJ=0;AF_EAS=0;AF_FIN=0;AF_NFE=0;AF_OTH=0;AF_SAS=0;AF_Male=0;AF_Female=0;GC_AFR=2545,0,0;GC_AMR=6461,0,0;GC_ASJ=2132,0,0;GC_EAS=3336,0,0;GC_FIN=5877,0,0;GC_NFE=24450,0,0;GC_OTH=1220,0,0;GC_SAS=6687,0,0;GC_Male=28861,0,0;GC_Female=23847,0,0;AC_raw=147;AN_raw=160104;AF_raw=0.000918153;GC_raw=79961,35,56;GC=52708,0,0;Hom_AFR=0;Hom_AMR=0;Hom_ASJ=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;Hom_Male=0;Hom_Female=0;Hom_raw=56;Hom=0;POPMAX=.;AC_POPMAX=.;AN_POPMAX=.;AF_POPMAX=.;DP_MEDIAN=3;DREF_MEDIAN=6.34858e-07;GQ_MEDIAN=9;AB_MEDIAN=0.5;AS_RF=0.0766183;AS_FilterStatus=RF|AC0
chr10 50095648 rs201672329 A G 4.59591e+06 PASS AC=6331;AF=0.0401138;AN=157826;BaseQRankSum=-0.494;ClippingRankSum=0.024;DB;DP=4164344;FS=0.536;InbreedingCoeff=0.3791;MQ=26.03;MQRankSum=0;QD=14.26;ReadPosRankSum=0.267;SOR=0.634;VQSLOD=1.79;VQSR_culprit=MQ;GQ_HIST_ALT=41|165|136|275|231|113|194|163|96|131|118|105|137|122|86|130|146|82|106|3409;DP_HIST_ALT=236|828|885|775|583|415|320|232|208|205|181|197|162|143|100|73|66|46|30|24;AB_HIST_ALT=0|0|10|24|70|142|183|337|613|589|785|410|253|105|36|45|44|47|0|0;GQ_HIST_ALL=3343|6713|4663|10424|9906|4535|7763|6741|2764|4784|3731|1595|4284|781|2125|1264|2833|378|2571|25713;DP_HIST_ALL=7947|16027|16329|13266|10848|8104|6717|6691|5335|4890|3734|2685|1589|919|501|309|191|122|86|62;AB_HIST_ALL=0|0|10|24|70|142|183|337|613|589|784|410|254|105|36|45|44|47|0|0;AC_AFR=894;AC_AMR=537;AC_ASJ=312;AC_EAS=304;AC_FIN=786;AC_NFE=2850;AC_OTH=133;AC_SAS=515;AC_Male=3397;AC_Female=2934;AN_AFR=9284;AN_AMR=23204;AN_ASJ=7354;AN_EAS=12128;AN_FIN=15082;AN_NFE=66842;AN_OTH=3646;AN_SAS=20286;AN_Male=85740;AN_Female=72086;AF_AFR=0.0962947;AF_AMR=0.0231426;AF_ASJ=0.0424259;AF_EAS=0.025066;AF_FIN=0.0521151;AF_NFE=0.0426379;AF_OTH=0.0364783;AF_SAS=0.025387;AF_Male=0.0396198;AF_Female=0.0407014;GC_AFR=3952,486,204;GC_AMR=11178,311,113;GC_ASJ=3436,170,71;GC_EAS=5818,188,58;GC_FIN=6979,338,224;GC_NFE=31239,1514,668;GC_OTH=1716,81,26;GC_SAS=9757,257,129;GC_Male=40287,1769,814;GC_Female=33788,1576,679;AC_raw=8279;AN_raw=213822;AF_raw=0.0387191;GC_raw=100925,3693,2293;GC=74075,3345,1493;Hom_AFR=204;Hom_AMR=113;Hom_ASJ=71;Hom_EAS=58;Hom_FIN=224;Hom_NFE=668;Hom_OTH=26;Hom_SAS=129;Hom_Male=814;Hom_Female=679;Hom_raw=2293;Hom=1493;POPMAX=AFR;AC_POPMAX=894;AN_POPMAX=9284;AF_POPMAX=0.0962947;DP_MEDIAN=22;DREF_MEDIAN=6.30957e-33;GQ_MEDIAN=99;AB_MEDIAN=0.481481;AS_RF=0.865029;AS_FilterStatus=PASS
Hopefully, Ensembl will soon sync their vcf with gnomad2.1 and we will not need processing the vcf in cloudbiolinux/bcbio installation anymore. In the meanwhile, we need that, at least for users who are running bcbio for the first time. They just need a smooth pipeline run, i.e. to call, annotate and prioritize variants, not going into details about little issues of gnomad 2.0.1 reprocessed by ensembl.
One of the duplicated variants has PASS in the filter tag and the other has not. That gave me an idea to prioritize PASS variants rather than first variants when selecting unique ones.
The processing: chr name conversion - sort - PASS - decompose - normalize - uniq works:
bcftools stats for chr10
# SN [2]id [3]key [4]value
SN 0 number of samples: 0
SN 0 number of records: 538598
SN 0 number of no-ALTs: 537
SN 0 number of SNPs: 503850
SN 0 number of MNPs: 59
SN 0 number of indels: 33917
SN 0 number of others: 235
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
Duplicated variant in chr10 is gone:
tabix gnomad.exomes.r2.0.1.sites.GRCh38.noVEP.sorted.chr10.processed.vcf.gz chr10:50095648-50095648
chr10 50095648 rs201672329 A G 4.59591e+06 PASS AC=6331;AF=0.0401138;AN=157826;BaseQRankSum=-0.494;ClippingRankSum=0.024;DB;DP=4164344;FS=0.536;InbreedingCoeff=0.3791;MQ=26.03;MQRankSum=0;QD=14.26;ReadPosRankSum=0.267;SOR=0.634;VQSLOD=1.79;VQSR_culprit=MQ;GQ_HIST_ALT=41|165|136|275|231|113|194|163|96|131|118|105|137|122|86|130|146|82|106|3409;DP_HIST_ALT=236|828|885|775|583|415|320|232|208|205|181|197|162|143|100|73|66|46|30|24;AB_HIST_ALT=0|0|10|24|70|142|183|337|613|589|785|410|253|105|36|45|44|47|0|0;GQ_HIST_ALL=3343|6713|4663|10424|9906|4535|7763|6741|2764|4784|3731|1595|4284|781|2125|1264|2833|378|2571|25713;DP_HIST_ALL=7947|16027|16329|13266|10848|8104|6717|6691|5335|4890|3734|2685|1589|919|501|309|191|122|86|62;AB_HIST_ALL=0|0|10|24|70|142|183|337|613|589|784|410|254|105|36|45|44|47|0|0;AC_AFR=894;AC_AMR=537;AC_ASJ=312;AC_EAS=304;AC_FIN=786;AC_NFE=2850;AC_OTH=133;AC_SAS=515;AC_Male=3397;AC_Female=2934;AN_AFR=9284;AN_AMR=23204;AN_ASJ=7354;AN_EAS=12128;AN_FIN=15082;AN_NFE=66842;AN_OTH=3646;AN_SAS=20286;AN_Male=85740;AN_Female=72086;AF_AFR=0.0962947;AF_AMR=0.0231426;AF_ASJ=0.0424259;AF_EAS=0.025066;AF_FIN=0.0521151;AF_NFE=0.0426379;AF_OTH=0.0364783;AF_SAS=0.025387;AF_Male=0.0396198;AF_Female=0.0407014;GC_AFR=3952,486,204;GC_AMR=11178,311,113;GC_ASJ=3436,170,71;GC_EAS=5818,188,58;GC_FIN=6979,338,224;GC_NFE=31239,1514,668;GC_OTH=1716,81,26;GC_SAS=9757,257,129;GC_Male=40287,1769,814;GC_Female=33788,1576,679;AC_raw=8279;AN_raw=213822;AF_raw=0.0387191;GC_raw=100925,3693,2293;GC=74075,3345,1493;Hom_AFR=204;Hom_AMR=113;Hom_ASJ=71;Hom_EAS=58;Hom_FIN=224;Hom_NFE=668;Hom_OTH=26;Hom_SAS=129;Hom_Male=814;Hom_Female=679;Hom_raw=2293;Hom=1493;POPMAX=AFR;AC_POPMAX=894;AN_POPMAX=9284;AF_POPMAX=0.0962947;DP_MEDIAN=22;DREF_MEDIAN=6.30957e-33;GQ_MEDIAN=99;AB_MEDIAN=0.481481;AS_RF=0.865029;AS_FilterStatus=PASS
I've updated the GGD for gnomad_exome in hg38 according to this test. Please let me know if you have any other suggestions. https://github.com/chapmanb/cloudbiolinux/pull/287
Sergey
Hi Sergey,
Nice job getting into the nitty gritty of this issue! Ensembl plans to release their version of GnomAD 2.1 somewhere in March, for release 96 (http://www.ensembl.info/2019/02/07/whats-coming-in-ensembl-96-ensembl-genomes-43/#comment-94690). It's too bad Broad doesn't offer a lifted version of GnomAD themselves, otherwise this issue would be quickly solved.
I do have some further questions:
Did you test v2.0.1 files from Broad for duplicated variants? If not, could you? This way we'd be able to determine if the issue is present in the source, or some artifact from the liftover and processing Ensembl has done. Once we know this, we can tackle issues like this at the source, instead of trying to patch it up with postprocessing.
After processing using your updated method, do you have any stats on how many variants are filtered out? According to your stats, we lose about 44K records on chr10 alone.
When annotating using the newly formatted vcf, did you use the default bcbio vcfanno config? Or did you restore the original one (e.g. `ops=["self"] on all fields)
And then some questions for everyone:
Thanks again for al the work, nice job.
Cheers M
Update
bcftools stats
for GnomAD 2.0.1, chr10
$ bcftools stats https://storage.googleapis.com/gnomad-public/release/2.0.1/vcf/exomes/gnomad.exomes.r2.0.1.sites.vcf.gz -r 10
...
# SN [2]id [3]key [4]value
SN 0 number of samples: 0
SN 0 number of records: 584309
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 555360
SN 0 number of MNPs: 0
SN 0 number of indels: 40395
SN 0 number of others: 0
SN 0 number of multiallelic sites: 66269
SN 0 number of multiallelic SNP sites: 52835
...
And when checking the same region on chr10, there's no output.
$tabix https://storage.googleapis.com/gnomad-public/release/2.0.1/vcf/exomes/gnomad.exomes.r2.0.1.sites.vcf.gz 10:50095648-50095648
<EMPTY OUTPUT>
Which kind of makes sense, because this is 38 build region we're querying against a build 37 file...
Also, we have to keep in mind GnomAD doens't use the chr
prefix, so that's something to take into account when querying the original files.
BUT, when doing the same with lifted coordinates to 37, I get one record
$ tabix https://storage.googleapis.com/gnomad-public/release/2.0.1/vcf/exomes/gnomad.exomes.r2.0.1.sites.vcf.gz 10:51855408-51855408
10 51855408 . A G 7379.20 RF;AC0 AC=0;AF=0.00000e+00;AN=105416;BaseQRankSum=0.00000e+00;ClippingRankSum=1.96000e-01;DP=2665618;FS=7.43000e+01;InbreedingCoeff=2.09000e-02;MQ=2.39500e+01;MQRankSum=0.00000e+00;QD=6.76000e+00;ReadPosRankSum=3.58000e-01;SOR=4.73300e+00;VQSLOD=-2.21900e+02;VQSR_culprit=FS;GQ_HIST_ALT=2|44|9|12|2|2|0|4|6|4|0|0|4|1|0|1|0|0|0|0;DP_HIST_ALT=65|22|0|0|0|1|0|2|1|0|0|0|0|0|0|0|0|0|0|0;AB_HIST_ALT=0|0|3|1|1|1|0|1|4|0|9|0|5|6|0|2|1|1|0|0;GQ_HIST_ALL=5197|7001|4213|8918|7852|3162|5118|4039|1689|2849|2482|1151|3003|499|1655|845|2133|299|2102|15845;DP_HIST_ALL=10460|13434|12411|7664|6367|5948|6206|6267|3998|2901|2040|1316|633|275|85|29|9|4|1|1;AB_HIST_ALL=0|0|3|1|1|1|0|1|4|0|9|0|5|6|0|2|1|1|0|0;AC_AFR=0;AC_AMR=0;AC_ASJ=0;AC_EAS=0;AC_FIN=0;AC_NFE=0;AC_OTH=0;AC_SAS=0;AC_Male=0;AC_Female=0;AN_AFR=5090;AN_AMR=12922;AN_ASJ=4264;AN_EAS=6672;AN_FIN=11754;AN_NFE=48900;AN_OTH=2440;AN_SAS=13374;AN_Male=57722;AN_Female=47694;AF_AFR=0.00000e+00;AF_AMR=0.00000e+00;AF_ASJ=0.00000e+00;AF_EAS=0.00000e+00;AF_FIN=0.00000e+00;AF_NFE=0.00000e+00;AF_OTH=0.00000e+00;AF_SAS=0.00000e+00;AF_Male=0.00000e+00;AF_Female=0.00000e+00;GC_AFR=2545,0,0;GC_AMR=6461,0,0;GC_ASJ=2132,0,0;GC_EAS=3336,0,0;GC_FIN=5877,0,0;GC_NFE=24450,0,0;GC_OTH=1220,0,0;GC_SAS=6687,0,0;GC_Male=28861,0,0;GC_Female=23847,0,0;AC_raw=147;AN_raw=160104;AF_raw=9.18153e-04;GC_raw=79961,35,56;GC=52708,0,0;Hom_AFR=0;Hom_AMR=0;Hom_ASJ=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;Hom_Male=0;Hom_Female=0;Hom_raw=56;Hom=0;POPMAX=.;AC_POPMAX=.;AN_POPMAX=.;AF_POPMAX=.;DP_MEDIAN=3;DREF_MEDIAN=6.34858e-07;GQ_MEDIAN=9;AB_MEDIAN=5.00000e-01;AS_RF=7.66183e-02;AS_FilterStatus=RF|AC0;CSQ=G|synonymous_variant|LOW|FAM21A|ENSG00000099290|Transcript|ENST00000282633|protein_coding|15/31||ENST00000282633.5:c.1290A>G|ENST00000282633.5:c.1290A>G(p.%3D)|1335|1290|430|P|ccA/ccG||1||1||SNV|1|HGNC|23416|YES|||CCDS41527.1|ENSP00000282633|Q641Q2|Q6P0Q7&Q5SNT8&B4E255|UPI000044FEAB||||hmmpanther:PTHR21669&hmmpanther:PTHR21669:SF4||||||||||||||||||||||||||||||,G|synonymous_variant|LOW|FAM21A|ENSG00000099290|Transcript|ENST00000314664|protein_coding|15/29||ENST00000314664.7:c.1290A>G|ENST00000314664.7:c.1290A>G(p.%3D)|1408|1290|430|P|ccA/ccG||1||1||SNV|1|HGNC|23416|||||ENSP00000314417||Q6P0Q7&E7ESD2|UPI00020658FD||||hmmpanther:PTHR21669&hmmpanther:PTHR21669:SF4||||||||||||||||||||||||||||||,G|synonymous_variant|LOW|FAM21A|ENSG00000099290|Transcript|ENST00000351071|protein_coding|15/30||ENST00000351071.6:c.1290A>G|ENST00000351071.6:c.1290A>G(p.%3D)|1408|1290|430|P|ccA/ccG||1||1||SNV|1|HGNC|23416|||||ENSP00000344037|Q641Q2|Q9Y4N4&Q6P0Q7|UPI00001C1ED7||||hmmpanther:PTHR21669&hmmpanther:PTHR21669:SF4||||||||||||||||||||||||||||||,G|synonymous_variant|LOW|FAM21A|ENSG00000099290|Transcript|ENST00000399339|protein_coding|13/29||ENST00000399339.2:c.1026A>G|ENST00000399339.2:c.1026A>G(p.%3D)|1026|1026|342|P|ccA/ccG||1||1||SNV|1|HGNC|23416|||||ENSP00000382277||Q6P0Q7&Q5SNT8&F8W7U3&B4E255|UPI0001AE6DA1||||hmmpanther:PTHR21669&hmmpanther:PTHR21669:SF4||||||||||||||||||||||||||||||,G|3_prime_UTR_variant&NMD_transcript_variant|MODIFIER|FAM21A|ENSG00000099290|Transcript|ENST00000434114|nonsense_mediated_decay|13/28||ENST00000434114.2:c.*656A>G||1059||||||1||1||SNV|1|HGNC|23416|||||ENSP00000403781||Q5T1D7|UPI0002B8336F||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|FAM21A|ENSG00000099290|Transcript|ENST00000492914|processed_transcript|||||||||||1|4143|1||SNV|1|HGNC|23416||||||||||||||||||||||||||||||||||||||||||
Doing the same on the Ensembl file (b37) yields the same record (minus the VEP annotations)
tabix http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh37/variation_genotype/gnomad.exomes.r2.0.1.sites.noVEP.vcf.gz 10:51855408-51855408 [8:47:46]
10 51855408 . A G 7379.20 RF;AC0 AC=0;AF=0.00000e+00;AN=105416;BaseQRankSum=0.00000e+00;ClippingRankSum=1.96000e-01;DP=2665618;FS=7.43000e+01;InbreedingCoeff=2.09000e-02;MQ=2.39500e+01;MQRankSum=0.00000e+00;QD=6.76000e+00;ReadPosRankSum=3.58000e-01;SOR=4.73300e+00;VQSLOD=-2.21900e+02;VQSR_culprit=FS;GQ_HIST_ALT=2|44|9|12|2|2|0|4|6|4|0|0|4|1|0|1|0|0|0|0;DP_HIST_ALT=65|22|0|0|0|1|0|2|1|0|0|0|0|0|0|0|0|0|0|0;AB_HIST_ALT=0|0|3|1|1|1|0|1|4|0|9|0|5|6|0|2|1|1|0|0;GQ_HIST_ALL=5197|7001|4213|8918|7852|3162|5118|4039|1689|2849|2482|1151|3003|499|1655|845|2133|299|2102|15845;DP_HIST_ALL=10460|13434|12411|7664|6367|5948|6206|6267|3998|2901|2040|1316|633|275|85|29|9|4|1|1;AB_HIST_ALL=0|0|3|1|1|1|0|1|4|0|9|0|5|6|0|2|1|1|0|0;AC_AFR=0;AC_AMR=0;AC_ASJ=0;AC_EAS=0;AC_FIN=0;AC_NFE=0;AC_OTH=0;AC_SAS=0;AC_Male=0;AC_Female=0;AN_AFR=5090;AN_AMR=12922;AN_ASJ=4264;AN_EAS=6672;AN_FIN=11754;AN_NFE=48900;AN_OTH=2440;AN_SAS=13374;AN_Male=57722;AN_Female=47694;AF_AFR=0.00000e+00;AF_AMR=0.00000e+00;AF_ASJ=0.00000e+00;AF_EAS=0.00000e+00;AF_FIN=0.00000e+00;AF_NFE=0.00000e+00;AF_OTH=0.00000e+00;AF_SAS=0.00000e+00;AF_Male=0.00000e+00;AF_Female=0.00000e+00;GC_AFR=2545,0,0;GC_AMR=6461,0,0;GC_ASJ=2132,0,0;GC_EAS=3336,0,0;GC_FIN=5877,0,0;GC_NFE=24450,0,0;GC_OTH=1220,0,0;GC_SAS=6687,0,0;GC_Male=28861,0,0;GC_Female=23847,0,0;AC_raw=147;AN_raw=160104;AF_raw=9.18153e-04;GC_raw=79961,35,56;GC=52708,0,0;Hom_AFR=0;Hom_AMR=0;Hom_ASJ=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;Hom_Male=0;Hom_Female=0;Hom_raw=56;Hom=0;POPMAX=.;AC_POPMAX=.;AN_POPMAX=.;AF_POPMAX=.;DP_MEDIAN=3;DREF_MEDIAN=6.34858e-07;GQ_MEDIAN=9;AB_MEDIAN=5.00000e-01;AS_RF=7.66183e-02;AS_FilterStatus=RF|AC0
Keeping all this in mind, I'd say the duplicated lines are artifacts from the liftover. Does anyone have any comments or insights on this?
M
Matthias and Sergey; Wow, thank you for this in-depth analysis. I'm agreed that the duplicates are likely liftover artifacts and like Sergey's practical way to avoid this. I'm not entirely sure 2.1 will avoid this since it will also be lifted over to 38, but I keep hearing that a native 38 build for gnomad is on the way so that will definitely avoid it. Fingers crossed that will be here soon.
A hg38 native GnomAD would be very nice indeed. In the mean time, I'm trying to lift the vcf's from broad myself using different chain files and executables. I'll let you know if something comes up.
Cheers M
Hi Matthias and Brad!
Thanks for the discussion - it is indeed an interesting story illustrating the importance of grch38 for variant calling.
Two sites in grch37 correspond to one site in grch38.
rs201672329 https://gnomad.broadinstitute.org/variant/10-47911549-A-G GRCh37: 10:47,911,549 (contig AL954360.3, gene FAM21B) → GRCh38: 10:50,095,648 (contig AL442003.8, gene WASHC2A)
NO_RS_ID GRCh37: 10:51,855,408 (contig AL442003.8, gene FAM21A) → GRCh38: 10:50,095,648 (contig AL442003.8, gene WASHC2A (=FAM21A=FAM21B))
In GRCh38 the gene is defined better: WASHC2A, two previous sequences (candidate genes FAM21A, FAM21B) were merged.
NCBI remap (grch38>grch37) https://www.ncbi.nlm.nih.gov/genome/tools/remap/ shows both sites in grch37 (and even the 3rd one on the PATCH).
If we thought of liftover as liftover_function(old_assembly, new_assembly), this example would become sooner an enhancement of the new assembly, rather than an artifact of the liftover function per se. It would be hard to blame liftover tools here. Of course, there might be another types of issues.
The story provides one more confirmation of grch38's superiority for variant calling. Misassembly in grch37 may lead to FP variants like NO_RS_ID. Also this story confirms the great work of GNOMAD team resulted in filtering out variants like NO_RS_ID.
Regarding the liftover improvement. I worked a bit with whole genome alignments (even for much smaller genomes), there might be a lot of technical difficulties even to apply liftover programs to large genomes. I see it as TCP/IP stack: there is assembly team, liftover team, gnomad team, Ensembl team, cloudbiolinux level, bcbio level, user level. I'm just happy that teams on the fundamental levels are doing such a great job. (Of course, not discouraging Matthias from doing a better liftover).
Regarding @matthdsm's questions:
Filter | Count | Percent |
---|---|---|
PASS | 474632 | 0.81 |
RF | 52906 | 0.09 |
RF;AC0 | 38161 | 0.06 |
AC0 | 7283 | 0.01 |
SEGDUP | 3018 | 0.005 |
LCR | 2947 | 0.005 |
RF;LCR | 1214 | 0.002 |
Very approximately (the number of records decreases by PASS filtration and increases after decomposition), -19% of records did not pass filters, and then decomposition gave +12% of variants.
I think it is fine to filter out these variants, I'd think I'd be in trouble from the clinical side if I included any of them in any type of analyses.
The main conclusion for me from this whole story is that gnomad frequencies in grch38, despite some minor issues, are usable.
Sergey
Hi Sergey,
Thanks for the very thorough analysis of the issue. Personally, I doubt I'd be able to do a "better" liftover than the team at Ensembl. The only thing better IMO would be a hg38 native version straight from Broad.
I'll throw some CPU time at the annotation with different vcfanno options and I'll let you know if the self
options work with the filtered gnomad file.
Thanks again M
Hi everyone,
I'd like to re-open this issue, as we are running into the same error when building a gemini
database with the latest bcbio-devel
. The gnomad_exome.vcf.gz
file contains duplicated entries
tabix gnomad_exome.vcf.gz chr1:146138417-146138417 | cut -f1-7
chr1 146138417 rs4068059 G T 33908.6 PASS
chr1 146138417 rs4068059 G A 33908.6 PASS
chr1 146138417 rs782126330 G A 4058.4 PASS
The process aborts with
sqlalchemy.exc.InterfaceError: (sqlite3.InterfaceError) Error binding parameter 157 - probably unsupported type.
[SQL: INSERT INTO variants [...]
How should we go about this? Thank you!
Federico
Thanks, this has been fixed, see here for an explanation: https://github.com/bcbio/bcbio-nextgen/issues/2961#issuecomment-541210508
Thanks @roryk, I will immediately upgrade.
Upgrading to latest devel (-u development --data
) does not seem to download any new dataset. Shall I try to remove gnomad*
files from /path/to/data/genomes/Hsapiens/hg38/variation/
prior to that?
Hi @roryk and @naumenko-sa,
while trying the fix the issue above, I removed gnomad*
annotations from /path/to/data/genomes/Hsapiens/hg38/variation/
. However, upgrading with -u development --data
does not pull any gnomad
file anymore. Do you know why this is the case, and how could we possibly address it?
Thanks a lot. Federico
Hi Frederico
You need to remove the gnomad reference in the ‘versions.csv’ file to trigger a download.
Cheers M
Op 16 okt. 2019 om 16:04 heeft Federico Comoglio notifications@github.com<mailto:notifications@github.com> het volgende geschreven:
Hi @rorykhttps://github.com/roryk and @naumenko-sahttps://github.com/naumenko-sa,
while trying the fix the issue above, I removed gnomad* annotations from /path/to/data/genomes/Hsapiens/hg38/variation/. However, upgrading with -u development --data does not pull any gnomad file anymore. Do you know why this is the case, and how could we possibly address it?
Thanks a lot. Federico
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/bcbio/bcbio-nextgen/issues/2567?email_source=notifications&email_token=AC2NHECMO2Q3F57GEERU5WLQO4NOTA5CNFSM4GCDDMIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBMTLOY#issuecomment-542717371, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC2NHEHS6UPKTONN3S2ZJBLQO4NOTANCNFSM4GCDDMIA.
Thanks @matthdsm. Downloading now.
I'd like to follow up on this issue. I pulled the latest gnomad
VCF and upgraded to bcbio v1.1.7 but the issue persists.
Hi everyone, we weren't able to fix the issue in the meantime. We'd appreciate any pointer that could help us here. Thank you!
@FedericoComoglio sorry for not getting back to you, it looks we've made some progress on that issue here: https://github.com/bcbio/bcbio-nextgen/issues/3087
Hello,
I was running bcbio v1.1.0 on a set of ~90 samples and I noticed 3 of those samples failed at the same step (vcf2db) with the same exact error. Upon further inspection I noticed that this error is caused by the same variant in all these 3 samples, and removing that variant from these vcf files seems to bypass the error (i.e. vcf2db finished successfully). I am attaching the error log, and a slice of the vcf file including the variant here (line 171 in the attached vcf file).
Command used:
vcf2db.py C3_4-gatk-haplotype-nomultiallelic-annotated-gemini.vcf.gz C3_4-gatk-haplotype-nomultiallelic.ped C3_4-gatk-haplotype.db
BCBIO LOG: bcbio-nextgen.log
CONFIG FILE: C3_4.yaml.txt
VCF FILE: (variant causing the error is on line 171) C3_4-gatk-haplotype-nomultiallelic-annotated-gemini.vcf.txt
PED FILE: C3_4-gatk-haplotype-nomultiallelic.ped.txt
Could you please help me with this error?
Thank you, Teja.