Closed komalsrathi closed 5 years ago
@kgururaj Can you take a look at this one.
These kind of errors are typically seen when the field description in the VCF header is incorrect. For example, describing the field length to be a fixed integer when the field is really variable length.
I would recommend closely scanning the VCF header for inconsistencies first.
Is there some sort of less manual shortcut, like trying to convert to BCF?
Hi,
Not sure if it is applicable but I just ran vcf-validator on the input gvcf file (also not sure if you can run vcf-validator on a gvcf):
vcf-validator Sample_1__CDL-164-04P-gatk-haplotype-annotated-rnaedit-annotated-gemini.vcf.gz
Here are the first few lines from the logs:
The header tag 'reference' not present. (Not required but highly recommended.)
INFO field at 1:14599 .. INFO tag [max_aaf_all=0.2096] expected different number of values (expected 2, found 1)
INFO field at 1:14604 .. INFO tag [max_aaf_all=0.2096] expected different number of values (expected 2, found 1)
INFO field at 1:14930 .. INFO tag [max_aaf_all=0.5231] expected different number of values (expected 2, found 1)
INFO field at 1:15211 .. INFO tag [max_aaf_all=0.7316] expected different number of values (expected 2, found 1)
INFO field at 1:15274 .. INFO tag [max_aaf_all=0.7205] expected different number of values (expected 3, found 1)
INFO field at 1:16949 .. INFO tag [max_aaf_all=0.0227] expected different number of values (expected 2, found 1)
INFO field at 1:17365 .. INFO tag [max_aaf_all=0.3841] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_afr=
0.1603] expected different number of values (expected 2, found 1),INFO tag [af_exac_all=0.2553] expected different number of values (expect
ed 2, found 1),INFO tag [af_adj_exac_nfe=0.2715] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_sas=0.2883
] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_oth=0.2581] expected different number of values (expected
2, found 1),INFO tag [af_adj_exac_eas=0.3841] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_amr=0.221] e
xpected different number of values (expected 2, found 1),INFO tag [af_adj_exac_fin=0.2245] expected different number of values (expected 2,
found 1)
INFO field at 1:17373 .. INFO tag [af_adj_exac_amr=0.028] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_fin=0.0342] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_eas=0.019324] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_sas=0.019802] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_oth=0.0625] expected different number of values (expected 2, found 1),INFO tag [af_exac_all=0.0425] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_nfe=0.0479] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_afr=0.049] expected different number of values (expected 2, found 1),INFO tag [max_aaf_all=0.0625] expected different number of values (expected 2, found 1)
I grepped the first and last positions from the above gist:
# pos 1:14599
1 14599 . T A,<NON_REF> 546.8 . BaseQRankSum=0.226;ClippingRankSum=0.000;DP=63;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQ0=0;MQRankSum=0.000;RAW_MQ=226800.00;ReadPosRankSum=2.821;rs_ids=rs707680,rs531646671;af_1kg_amr=0.1758;af_1kg_eas=0.0893;af_1kg_sas=0.2096;af_1kg_afr=0.121;af_1kg_eur=0.161;af_1kg_all=0.1476;fitcons=0.263;encode_consensus_gm12878=R;encode_consensus_h1hesc=R;encode_consensus_helas3=CTCF;encode_consensus_hepg2=R;encode_consensus_huvec=R;encode_consensus_k562=T;dgv=CopyNumber;hapmap1=2.9818;hapmap2=0;max_aaf_all=0.2096 GT:AD:DP:GQ:MMQ:PGT:PID:PL:SB 0/1:46,17,0:63:99:60,0:0|1:14574_A_G:575,0,1880,714,1932,2645:46,0,17,0
# pos 1:17373
1 17373 . A G,<NON_REF> 2.4 . BaseQRankSum=-1.763;ClippingRankSum=0.000;DP=21;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQ0=0;MQRankSum=0.000;RAW_MQ=75600.00;ReadPosRankSum=0.371;ac_exac_all=261;an_exac_all=6138;ac_adj_exac_afr=51;an_adj_exac_afr=1040;ac_adj_exac_amr=8;an_adj_exac_amr=286;ac_adj_exac_eas=8;an_adj_exac_eas=414;ac_adj_exac_fin=16;an_adj_exac_fin=468;ac_adj_exac_nfe=166;an_adj_exac_nfe=3462;ac_adj_exac_oth=4;an_adj_exac_oth=64;ac_adj_exac_sas=8;an_adj_exac_sas=404;num_exac_Het=261,.;num_exac_Hom=0,.;rs_ids=rs750111615;fitcons=0.2944;encode_consensus_gm12878=T;encode_consensus_h1hesc=CTCF;encode_consensus_helas3=CTCF;encode_consensus_hepg2=CTCF;encode_consensus_huvec=CTCF;encode_consensus_k562=CTCF;gerp_elements=3.4612e-09;dgv=CopyNumber;hapmap1=2.9818;hapmap2=0;af_exac_all=0.0425;af_adj_exac_afr=0.049;af_adj_exac_amr=0.028;af_adj_exac_eas=0.019324;af_adj_exac_fin=0.0342;af_adj_exac_nfe=0.0479;af_adj_exac_oth=0.0625;af_adj_exac_sas=0.019802;max_aaf_all=0.0625 GT:AD:DP:GQ:MMQ:PL:SB 0/1:19,2,0:21:27:60,0:27,0,1144,84,1150,1234:14,5,2,0
Here is the header of the gvcf file: header.txt
In the header, it says Number=A
for all these fields and because there is an additional non ref allele <NON_REF>
in the ALT
column, it is expecting two values for these fields whereas there is only 1 present.
That is definitely an issue, but I'm not sure if that's causing the crash. Would you take a look at fields similar to the following as well?
ac_adj_exac_amr,Number=1,Type=Float
@ldgauthier I wasn't aware of any - @komalsrathi 's suggestion is a good start.
Komal -- thanks for raising this issue and providing so much detail.
Karthik -- Thanks for the debugging and pointers on this, I hadn't thought from the core dump to be looking at the annotation fields so this is a really helpful lead. What we can do in bcbio is avoid adding any of these until after running the joint calling so they all get on the final joint VCF rather than the gVCFs. This should hopefully avoid needing to dig too deeply into this and we can take our lesson as: don't annotate gVCFs with too much information. Thank you again for helping with diagnosing the underlying issue.
@kgururaj
So I found out that there are other fields that are also throwing a similar error with vcf-validator
:
Below you will find this: INFO tag [an_adj_exac_oth=16,16] expected different number of values (1)
# from the vcf-validator
INFO field at 4:2044128 .. INFO tag [af_exac_all=0] expected different number of values (expected 3, found 1),INFO tag [af_adj_exac_fin=0] expected different number of values (expected 3, found 1),INFO tag [an_adj_exac_oth=16,16] expected different number of values (1),INFO tag [an_adj_exac_nfe=202,202] expected different number of values (1),INFO tag [an_adj_exac_afr=20,20] expected different number of values (1),INFO tag [af_adj_exac_amr=0] expected different number of values (expected 3, found 1),INFO tag [an_exac_all=1246,1246] expected different number of values (1),INFO tag [an_adj_exac_amr=10,10] expected different number of values (1),INFO tag [af_adj_exac_oth=0] expected different number of values (expected 3, found 1),INFO tag [an_adj_exac_eas=30,30] expected different number of values (1),INFO tag [an_adj_exac_fin=2,2] expected different number of values (1),INFO tag [an_adj_exac_sas=966,966] expected different number of values (1),INFO tag [af_adj_exac_sas=0] expected different number of values (expected 3, found 1),INFO tag [af_adj_exac_afr=0] expected different number of values (expected 3, found 1),INFO tag [af_adj_exac_nfe=0] expected different number of values (expected 3, found 1),INFO tag [max_aaf_all=1] expected different number of values (expected 3, found 1),INFO tag [af_adj_exac_eas=0] expected different number of values (expected 3, found 1)
In this case, an_adj_exac_oth
has > 1 values and only 1 is allowed:
grep
##INFO=<ID=an_adj_exac_oth,Number=1,Type=Integer,Description="Other Chromosome Count (from /mnt/isilon/cbmi/variome/bin/bcbio-nextgen/bcbio/gemini_data/ExAC.r0.3.sites.vep.tidy.vcf.gz)">
# the corresponding variant in the vcf
4 2044128 . C T,CGCT,<NON_REF> 3476.7 . DP=97;ExcessHet=3.0103;MLEAC=2,0,0;MLEAF=1.00,0.00,0.00;MQ0=0;RAW_MQ=349200.00;ac_exac_all=1061;an_exac_all=1246,1246;ac_adj_exac_afr=17;an_adj_exac_afr=20,20;ac_adj_exac_amr=8;an_adj_exac_amr=10,10;ac_adj_exac_eas=25;an_adj_exac_eas=30,30;ac_adj_exac_fin=2;an_adj_exac_fin=2,2;ac_adj_exac_nfe=165;an_adj_exac_nfe=202,202;ac_adj_exac_oth=13;an_adj_exac_oth=16,16;ac_adj_exac_sas=831;an_adj_exac_sas=966,966;num_exac_Het=184,0,.;num_exac_Hom=438,0,.;rs_ids=rs570712,rs764547021;af_1kg_amr=1;af_1kg_eas=1;af_1kg_sas=1;af_1kg_afr=0.9728;af_1kg_eur=1;af_1kg_all=0.9928;fitcons=0.6576;encode_consensus_gm12878=E;encode_consensus_h1hesc=E;encode_consensus_helas3=unknown;encode_consensus_hepg2=E;encode_consensus_huvec=E;encode_consensus_k562=E;rmsk=Simple_repeat_Simple_repeat_(CTG)n;gerp_elements=6.7845e-2;cpg_island;dgv=CopyNumber;tfbs=HA-E2F1,E2F6_(H-50);hapmap1=4.1909;hapmap2=2.1784;stam_mean=136.536;stam_names=8988t,Adult_Th0,CACO2,Chorion,Cll,E_myoblast,Fibrobl,Fibrop,Gliobla,Gm12891,Gm12892,Gm18507,Gm19238,Gm19239,Gm19240,H9es,HESC,HL60,HMEC,HSMM,HSMM_D,HUVEC,Hela,Helas3Ifna4h,HepG2,Hepatocytes,Hpde6e6e7,Htr8,Huh75,Ips,Ishikawa_E,Ishikawa_T,K562,LNCap,LncapAndro,MCF7,Mcf7Hypoxlac,Medullo,Melano,Myometr,NHEK,Osteobl,Panisd,Panislets,Phte,Progfib,RWPE,Stellate,T47d,Urotsa,hTH1;af_exac_all=0;af_adj_exac_afr=0;af_adj_exac_amr=0;af_adj_exac_eas=0;af_adj_exac_fin=0;af_adj_exac_nfe=0;af_adj_exac_oth=0;af_adj_exac_sas=0;max_aaf_all=1 GT:AD:DP:GQ:MMQ:PL:SB 1/1:0,97,0,0:97:99:60,0,0:3514,291,0,3638,292,4365,3623,292,4049,3939:0,0,33,64
Yep, that will definitely cause a crash - variable length fields need a length value stored while fixed length fields don't.
I can add checks for this scenario in GenomicsDB. Please let me know if the following make sense: Header says that the field F is a fixed length field with length = N. In the data section, if
@chapmanb What we can do in bcbio is avoid adding any of these until after running the joint calling so they all get on the final joint VCF rather than the gVCFs.
I think that would be the right way to go (irrespective of the inconsistencies in the headers). My understanding as a layman (I'm not a bioinformatician) is that many of these annotation fields are (population/group) statistics. By storing these annotations in the individual gVCFs and then importing them into GenomicsDB, the size of the storage and the time to import are increased significantly (since the statistics are being stored in every sample for every position with a record in the gVCF).
Karthik; Thanks for this, I've done that in bcbio so hopefully will avoid the issue going forward. Feel free to close here unless you want to try and trace down further what is happening. Thanks again for the pointer that led us to the underlying issue.
Closing here as this was fixed: https://github.com/bcbio/bcbio-nextgen/issues/2453
@kgururaj @ldgauthier I'd propose that we add defensive code to detect this, as @kgururaj proposed above. Maybe throw with a message identifying the name of the field and the issue ?
Hi,
I am using GATK
version gatk4-4.0.6.0-0
as part of the bcbio-nextgen pipeline for RNA-seq variant calling. There is one step in the pipeline i.e.gatk GenomicsDBImport
that's been failing consistently no matter how less or many resources in terms of memory and cores I provide. I have tried to run the command as part of the pipeline and in stand-alone mode (like below) and both produce the same error:Then, I tried to enable core dumping:
And here is the
hs_err_pid68672.log
file: hs_err_pid68672.log