broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

GenomicsDBImport: A fatal error has been detected by the Java Runtime Environment #5045

Closed komalsrathi closed 5 years ago

komalsrathi commented 6 years ago

Hi,

I am using GATK version gatk4-4.0.6.0-0 as part of the bcbio-nextgen pipeline for RNA-seq variant calling. There is one step in the pipeline i.e. gatk GenomicsDBImport that's been failing consistently no matter how less or many resources in terms of memory and cores I provide. I have tried to run the command as part of the pipeline and in stand-alone mode (like below) and both produce the same error:

[rathik@reslnrefo01 log]$ gatk --java-options '-Xms454m -Xmx3181m -XX:+UseSerialGC' GenomicsDBImport --reader-threads 1 --genomicsdb-workspace-path CDL-164-04P-1_0_249250621_genomicsdb -L 1:1-249250621 --variant /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/gatk_output/CDL-164-04P/variation/rnaseq/gatk-haplotype/Sample_1__CDL-164-04P-gatk-haplotype-annotated-rnaedit-annotated-gemini.vcf.gz
Using GATK jar /mnt/isilon/cbmi/variome/bin/bcbio-nextgen/bcbio/anaconda/share/gatk4-4.0.6.0-0/gatk-package-4.0.6.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms454m -Xmx3181m -XX:+UseSerialGC -jar /mnt/isilon/cbmi/variome/bin/bcbio-nextgen/bcbio/anaconda/share/gatk4-4.0.6.0-0/gatk-package-4.0.6.0-local.jar GenomicsDBImport --reader-threads 1 --genomicsdb-workspace-path CDL-164-04P-1_0_249250621_genomicsdb -L 1:1-249250621 --variant /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/gatk_output/CDL-164-04P/variation/rnaseq/gatk-haplotype/Sample_1__CDL-164-04P-gatk-haplotype-annotated-rnaedit-annotated-gemini.vcf.gz
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/mnt/isilon/cbmi/variome/tmp/rathik
11:49:24.784 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/isilon/cbmi/variome/bin/bcbio-nextgen/bcbio/anaconda/share/gatk4-4.0.6.0-0/gatk-package-4.0.6.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
11:49:25.130 INFO  GenomicsDBImport - ------------------------------------------------------------
11:49:25.130 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.0.6.0
11:49:25.131 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
11:49:25.134 INFO  GenomicsDBImport - Executing as rathik@reslnrefo01.research.chop.edu on Linux v3.10.0-862.el7.x86_64 amd64
11:49:25.134 INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_171-b10
11:49:25.134 INFO  GenomicsDBImport - Start Date/Time: July 20, 2018 11:49:24 AM EDT
11:49:25.134 INFO  GenomicsDBImport - ------------------------------------------------------------
11:49:25.134 INFO  GenomicsDBImport - ------------------------------------------------------------
11:49:25.135 INFO  GenomicsDBImport - HTSJDK Version: 2.16.0
11:49:25.135 INFO  GenomicsDBImport - Picard Version: 2.18.7
11:49:25.136 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
11:49:25.136 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:49:25.136 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:49:25.136 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:49:25.136 INFO  GenomicsDBImport - Deflater: IntelDeflater
11:49:25.136 INFO  GenomicsDBImport - Inflater: IntelInflater
11:49:25.136 INFO  GenomicsDBImport - GCS max retries/reopens: 20
11:49:25.136 INFO  GenomicsDBImport - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
11:49:25.137 INFO  GenomicsDBImport - Initializing engine
11:49:25.926 INFO  IntervalArgumentCollection - Processing 249250621 bp from intervals
11:49:25.931 INFO  GenomicsDBImport - Done initializing engine
Created workspace /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/gatk_output/CDL-164-04P/log/CDL-164-04P-1_0_249250621_genomicsdb
11:49:26.339 INFO  GenomicsDBImport - Vid Map JSON file will be written to CDL-164-04P-1_0_249250621_genomicsdb/vidmap.json
11:49:26.339 INFO  GenomicsDBImport - Callset Map JSON file will be written to CDL-164-04P-1_0_249250621_genomicsdb/callset.json
11:49:26.339 INFO  GenomicsDBImport - Complete VCF Header will be written to CDL-164-04P-1_0_249250621_genomicsdb/vcfheader.vcf
11:49:26.339 INFO  GenomicsDBImport - Importing to array - CDL-164-04P-1_0_249250621_genomicsdb/genomicsdb_array
11:49:26.339 INFO  ProgressMeter - Starting traversal
11:49:26.340 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
11:49:26.697 INFO  GenomicsDBImport - Importing batch 1 with 1 samples
Buffer resized from 22726bytes to 32529
Buffer resized from 32529bytes to 32693
Buffer resized from 32693bytes to 32738
Buffer resized from 32738bytes to 32741
Buffer resized from 32741bytes to 32756
Buffer resized from 32756bytes to 32768
Buffer resized from 32768bytes to 32769
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f9935bac359, pid=8320, tid=0x00007f99881bf700
#
# JRE version: OpenJDK Runtime Environment (8.0_171-b10) (build 1.8.0_171-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.171-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libtiledbgenomicsdb1661720680664773125.so+0x155359]  BufferVariantCell::set_cell(void const*)+0x99
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/gatk_output/CDL-164-04P/log/hs_err_pid8320.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Then, I tried to enable core dumping:

[rathik@reslnrefo01 CDL-164-04P]$ ulimit -c unlimited

[rathik@reslnrefo01 CDL-164-04P]$ gatk --java-options '-Xms454m -Xmx3181m -XX:+UseSerialGC -Djava.io.tmpdir=/mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/gatk_output/CDL-164-04P/joint/gatk-haplotype-joint/CDL-164-04P/1/bcbiotx/tmp1SAfjS' GenomicsDBImport --reader-threads 1 --genomicsdb-workspace-path CDL-164-04P-1_0_249250621_genomicsdb -L 1:1-249250621 --variant /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/gatk_output/CDL-164-04P/variation/rnaseq/gatk-haplotype/Sample_1__CDL-164-04P-gatk-haplotype-annotated-rnaedit-annotated-gemini.vcf.gz
Using GATK jar /mnt/isilon/cbmi/variome/bin/bcbio-nextgen/bcbio/anaconda/share/gatk4-4.0.6.0-0/gatk-package-4.0.6.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms454m -Xmx3181m -XX:+UseSerialGC -Djava.io.tmpdir=/mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/gatk_output/CDL-164-04P/joint/gatk-haplotype-joint/CDL-164-04P/1/bcbiotx/tmp1SAfjS -jar /mnt/isilon/cbmi/variome/bin/bcbio-nextgen/bcbio/anaconda/share/gatk4-4.0.6.0-0/gatk-package-4.0.6.0-local.jar GenomicsDBImport --reader-threads 1 --genomicsdb-workspace-path CDL-164-04P-1_0_249250621_genomicsdb -L 1:1-249250621 --variant /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/gatk_output/CDL-164-04P/variation/rnaseq/gatk-haplotype/Sample_1__CDL-164-04P-gatk-haplotype-annotated-rnaedit-annotated-gemini.vcf.gz
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/mnt/isilon/cbmi/variome/tmp/rathik
10:24:55.378 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/isilon/cbmi/variome/bin/bcbio-nextgen/bcbio/anaconda/share/gatk4-4.0.6.0-0/gatk-package-4.0.6.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
10:24:56.133 INFO  GenomicsDBImport - ------------------------------------------------------------
10:24:56.133 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.0.6.0
10:24:56.134 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
10:24:56.136 INFO  GenomicsDBImport - Executing as rathik@reslnrefo01.research.chop.edu on Linux v3.10.0-862.9.1.el7.x86_64 amd64
10:24:56.136 INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_171-b10
10:24:56.136 INFO  GenomicsDBImport - Start Date/Time: July 23, 2018 10:24:55 AM EDT
10:24:56.136 INFO  GenomicsDBImport - ------------------------------------------------------------
10:24:56.136 INFO  GenomicsDBImport - ------------------------------------------------------------
10:24:56.137 INFO  GenomicsDBImport - HTSJDK Version: 2.16.0
10:24:56.138 INFO  GenomicsDBImport - Picard Version: 2.18.7
10:24:56.138 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
10:24:56.138 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
10:24:56.138 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
10:24:56.138 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
10:24:56.138 INFO  GenomicsDBImport - Deflater: IntelDeflater
10:24:56.138 INFO  GenomicsDBImport - Inflater: IntelInflater
10:24:56.139 INFO  GenomicsDBImport - GCS max retries/reopens: 20
10:24:56.139 INFO  GenomicsDBImport - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
10:24:56.139 INFO  GenomicsDBImport - Initializing engine
10:24:57.198 INFO  IntervalArgumentCollection - Processing 249250621 bp from intervals
10:24:57.205 INFO  GenomicsDBImport - Done initializing engine
Created workspace /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/gatk_output/CDL-164-04P/CDL-164-04P-1_0_249250621_genomicsdb
10:24:57.553 INFO  GenomicsDBImport - Vid Map JSON file will be written to CDL-164-04P-1_0_249250621_genomicsdb/vidmap.json
10:24:57.554 INFO  GenomicsDBImport - Callset Map JSON file will be written to CDL-164-04P-1_0_249250621_genomicsdb/callset.json
10:24:57.554 INFO  GenomicsDBImport - Complete VCF Header will be written to CDL-164-04P-1_0_249250621_genomicsdb/vcfheader.vcf
10:24:57.554 INFO  GenomicsDBImport - Importing to array - CDL-164-04P-1_0_249250621_genomicsdb/genomicsdb_array
10:24:57.554 INFO  ProgressMeter - Starting traversal
10:24:57.554 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
10:24:57.971 INFO  GenomicsDBImport - Importing batch 1 with 1 samples
Buffer resized from 22726bytes to 32529
Buffer resized from 32529bytes to 32693
Buffer resized from 32693bytes to 32738
Buffer resized from 32738bytes to 32741
Buffer resized from 32741bytes to 32756
Buffer resized from 32756bytes to 32768
Buffer resized from 32768bytes to 32769
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f7288295359, pid=68672, tid=0x00007f72dc187700
#
# JRE version: OpenJDK Runtime Environment (8.0_171-b10) (build 1.8.0_171-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.171-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libtiledbgenomicsdb8064358042335455262.so+0x155359]  BufferVariantCell::set_cell(void const*)+0x99
#
# Core dump written. Default location: /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/gatk_output/CDL-164-04P/core or core.68672
#
# An error report file with more information is saved as:
# /mnt/isilon/cbmi/variome/rathik/mendelian_rnaseq/gatk_output/CDL-164-04P/hs_err_pid68672.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

And here is the hs_err_pid68672.log file: hs_err_pid68672.log

cmnbroad commented 6 years ago

@kgururaj Can you take a look at this one.

kgururaj commented 6 years ago

These kind of errors are typically seen when the field description in the VCF header is incorrect. For example, describing the field length to be a fixed integer when the field is really variable length.

I would recommend closely scanning the VCF header for inconsistencies first.

ldgauthier commented 6 years ago

Is there some sort of less manual shortcut, like trying to convert to BCF?

komalsrathi commented 6 years ago

Hi,

Not sure if it is applicable but I just ran vcf-validator on the input gvcf file (also not sure if you can run vcf-validator on a gvcf):

vcf-validator Sample_1__CDL-164-04P-gatk-haplotype-annotated-rnaedit-annotated-gemini.vcf.gz

Here are the first few lines from the logs:

The header tag 'reference' not present. (Not required but highly recommended.)
INFO field at 1:14599 .. INFO tag [max_aaf_all=0.2096] expected different number of values (expected 2, found 1)
INFO field at 1:14604 .. INFO tag [max_aaf_all=0.2096] expected different number of values (expected 2, found 1)
INFO field at 1:14930 .. INFO tag [max_aaf_all=0.5231] expected different number of values (expected 2, found 1)
INFO field at 1:15211 .. INFO tag [max_aaf_all=0.7316] expected different number of values (expected 2, found 1)
INFO field at 1:15274 .. INFO tag [max_aaf_all=0.7205] expected different number of values (expected 3, found 1)
INFO field at 1:16949 .. INFO tag [max_aaf_all=0.0227] expected different number of values (expected 2, found 1)
INFO field at 1:17365 .. INFO tag [max_aaf_all=0.3841] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_afr=
0.1603] expected different number of values (expected 2, found 1),INFO tag [af_exac_all=0.2553] expected different number of values (expect
ed 2, found 1),INFO tag [af_adj_exac_nfe=0.2715] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_sas=0.2883
] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_oth=0.2581] expected different number of values (expected
 2, found 1),INFO tag [af_adj_exac_eas=0.3841] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_amr=0.221] e
xpected different number of values (expected 2, found 1),INFO tag [af_adj_exac_fin=0.2245] expected different number of values (expected 2,
 found 1)
INFO field at 1:17373 .. INFO tag [af_adj_exac_amr=0.028] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_fin=0.0342] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_eas=0.019324] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_sas=0.019802] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_oth=0.0625] expected different number of values (expected 2, found 1),INFO tag [af_exac_all=0.0425] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_nfe=0.0479] expected different number of values (expected 2, found 1),INFO tag [af_adj_exac_afr=0.049] expected different number of values (expected 2, found 1),INFO tag [max_aaf_all=0.0625] expected different number of values (expected 2, found 1)

I grepped the first and last positions from the above gist:

# pos 1:14599
1       14599   .       T       A,<NON_REF>     546.8   .       BaseQRankSum=0.226;ClippingRankSum=0.000;DP=63;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQ0=0;MQRankSum=0.000;RAW_MQ=226800.00;ReadPosRankSum=2.821;rs_ids=rs707680,rs531646671;af_1kg_amr=0.1758;af_1kg_eas=0.0893;af_1kg_sas=0.2096;af_1kg_afr=0.121;af_1kg_eur=0.161;af_1kg_all=0.1476;fitcons=0.263;encode_consensus_gm12878=R;encode_consensus_h1hesc=R;encode_consensus_helas3=CTCF;encode_consensus_hepg2=R;encode_consensus_huvec=R;encode_consensus_k562=T;dgv=CopyNumber;hapmap1=2.9818;hapmap2=0;max_aaf_all=0.2096       GT:AD:DP:GQ:MMQ:PGT:PID:PL:SB   0/1:46,17,0:63:99:60,0:0|1:14574_A_G:575,0,1880,714,1932,2645:46,0,17,0

# pos 1:17373
1       17373   .       A       G,<NON_REF>     2.4     .       BaseQRankSum=-1.763;ClippingRankSum=0.000;DP=21;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQ0=0;MQRankSum=0.000;RAW_MQ=75600.00;ReadPosRankSum=0.371;ac_exac_all=261;an_exac_all=6138;ac_adj_exac_afr=51;an_adj_exac_afr=1040;ac_adj_exac_amr=8;an_adj_exac_amr=286;ac_adj_exac_eas=8;an_adj_exac_eas=414;ac_adj_exac_fin=16;an_adj_exac_fin=468;ac_adj_exac_nfe=166;an_adj_exac_nfe=3462;ac_adj_exac_oth=4;an_adj_exac_oth=64;ac_adj_exac_sas=8;an_adj_exac_sas=404;num_exac_Het=261,.;num_exac_Hom=0,.;rs_ids=rs750111615;fitcons=0.2944;encode_consensus_gm12878=T;encode_consensus_h1hesc=CTCF;encode_consensus_helas3=CTCF;encode_consensus_hepg2=CTCF;encode_consensus_huvec=CTCF;encode_consensus_k562=CTCF;gerp_elements=3.4612e-09;dgv=CopyNumber;hapmap1=2.9818;hapmap2=0;af_exac_all=0.0425;af_adj_exac_afr=0.049;af_adj_exac_amr=0.028;af_adj_exac_eas=0.019324;af_adj_exac_fin=0.0342;af_adj_exac_nfe=0.0479;af_adj_exac_oth=0.0625;af_adj_exac_sas=0.019802;max_aaf_all=0.0625  GT:AD:DP:GQ:MMQ:PL:SB   0/1:19,2,0:21:27:60,0:27,0,1144,84,1150,1234:14,5,2,0

Here is the header of the gvcf file: header.txt

In the header, it says Number=A for all these fields and because there is an additional non ref allele <NON_REF> in the ALT column, it is expecting two values for these fields whereas there is only 1 present.

kgururaj commented 6 years ago

That is definitely an issue, but I'm not sure if that's causing the crash. Would you take a look at fields similar to the following as well?

ac_adj_exac_amr,Number=1,Type=Float

kgururaj commented 6 years ago

@ldgauthier I wasn't aware of any - @komalsrathi 's suggestion is a good start.

chapmanb commented 6 years ago

Komal -- thanks for raising this issue and providing so much detail.

Karthik -- Thanks for the debugging and pointers on this, I hadn't thought from the core dump to be looking at the annotation fields so this is a really helpful lead. What we can do in bcbio is avoid adding any of these until after running the joint calling so they all get on the final joint VCF rather than the gVCFs. This should hopefully avoid needing to dig too deeply into this and we can take our lesson as: don't annotate gVCFs with too much information. Thank you again for helping with diagnosing the underlying issue.

komalsrathi commented 6 years ago

@kgururaj

So I found out that there are other fields that are also throwing a similar error with vcf-validator:

Below you will find this: INFO tag [an_adj_exac_oth=16,16] expected different number of values (1)

# from the vcf-validator
INFO field at 4:2044128 .. INFO tag [af_exac_all=0] expected different number of values (expected 3, found 1),INFO tag [af_adj_exac_fin=0] expected different number of values (expected 3, found 1),INFO tag [an_adj_exac_oth=16,16] expected different number of values (1),INFO tag [an_adj_exac_nfe=202,202] expected different number of values (1),INFO tag [an_adj_exac_afr=20,20] expected different number of values (1),INFO tag [af_adj_exac_amr=0] expected different number of values (expected 3, found 1),INFO tag [an_exac_all=1246,1246] expected different number of values (1),INFO tag [an_adj_exac_amr=10,10] expected different number of values (1),INFO tag [af_adj_exac_oth=0] expected different number of values (expected 3, found 1),INFO tag [an_adj_exac_eas=30,30] expected different number of values (1),INFO tag [an_adj_exac_fin=2,2] expected different number of values (1),INFO tag [an_adj_exac_sas=966,966] expected different number of values (1),INFO tag [af_adj_exac_sas=0] expected different number of values (expected 3, found 1),INFO tag [af_adj_exac_afr=0] expected different number of values (expected 3, found 1),INFO tag [af_adj_exac_nfe=0] expected different number of values (expected 3, found 1),INFO tag [max_aaf_all=1] expected different number of values (expected 3, found 1),INFO tag [af_adj_exac_eas=0] expected different number of values (expected 3, found 1)

In this case, an_adj_exac_oth has > 1 values and only 1 is allowed:

grep 
##INFO=<ID=an_adj_exac_oth,Number=1,Type=Integer,Description="Other Chromosome Count (from /mnt/isilon/cbmi/variome/bin/bcbio-nextgen/bcbio/gemini_data/ExAC.r0.3.sites.vep.tidy.vcf.gz)">

# the corresponding variant in the vcf 
4   2044128 .   C   T,CGCT,<NON_REF>    3476.7  .   DP=97;ExcessHet=3.0103;MLEAC=2,0,0;MLEAF=1.00,0.00,0.00;MQ0=0;RAW_MQ=349200.00;ac_exac_all=1061;an_exac_all=1246,1246;ac_adj_exac_afr=17;an_adj_exac_afr=20,20;ac_adj_exac_amr=8;an_adj_exac_amr=10,10;ac_adj_exac_eas=25;an_adj_exac_eas=30,30;ac_adj_exac_fin=2;an_adj_exac_fin=2,2;ac_adj_exac_nfe=165;an_adj_exac_nfe=202,202;ac_adj_exac_oth=13;an_adj_exac_oth=16,16;ac_adj_exac_sas=831;an_adj_exac_sas=966,966;num_exac_Het=184,0,.;num_exac_Hom=438,0,.;rs_ids=rs570712,rs764547021;af_1kg_amr=1;af_1kg_eas=1;af_1kg_sas=1;af_1kg_afr=0.9728;af_1kg_eur=1;af_1kg_all=0.9928;fitcons=0.6576;encode_consensus_gm12878=E;encode_consensus_h1hesc=E;encode_consensus_helas3=unknown;encode_consensus_hepg2=E;encode_consensus_huvec=E;encode_consensus_k562=E;rmsk=Simple_repeat_Simple_repeat_(CTG)n;gerp_elements=6.7845e-2;cpg_island;dgv=CopyNumber;tfbs=HA-E2F1,E2F6_(H-50);hapmap1=4.1909;hapmap2=2.1784;stam_mean=136.536;stam_names=8988t,Adult_Th0,CACO2,Chorion,Cll,E_myoblast,Fibrobl,Fibrop,Gliobla,Gm12891,Gm12892,Gm18507,Gm19238,Gm19239,Gm19240,H9es,HESC,HL60,HMEC,HSMM,HSMM_D,HUVEC,Hela,Helas3Ifna4h,HepG2,Hepatocytes,Hpde6e6e7,Htr8,Huh75,Ips,Ishikawa_E,Ishikawa_T,K562,LNCap,LncapAndro,MCF7,Mcf7Hypoxlac,Medullo,Melano,Myometr,NHEK,Osteobl,Panisd,Panislets,Phte,Progfib,RWPE,Stellate,T47d,Urotsa,hTH1;af_exac_all=0;af_adj_exac_afr=0;af_adj_exac_amr=0;af_adj_exac_eas=0;af_adj_exac_fin=0;af_adj_exac_nfe=0;af_adj_exac_oth=0;af_adj_exac_sas=0;max_aaf_all=1    GT:AD:DP:GQ:MMQ:PL:SB   1/1:0,97,0,0:97:99:60,0,0:3514,291,0,3638,292,4365,3623,292,4049,3939:0,0,33,64
kgururaj commented 6 years ago

Yep, that will definitely cause a crash - variable length fields need a length value stored while fixed length fields don't.

I can add checks for this scenario in GenomicsDB. Please let me know if the following make sense: Header says that the field F is a fixed length field with length = N. In the data section, if

kgururaj commented 6 years ago

@chapmanb What we can do in bcbio is avoid adding any of these until after running the joint calling so they all get on the final joint VCF rather than the gVCFs.

I think that would be the right way to go (irrespective of the inconsistencies in the headers). My understanding as a layman (I'm not a bioinformatician) is that many of these annotation fields are (population/group) statistics. By storing these annotations in the individual gVCFs and then importing them into GenomicsDB, the size of the storage and the time to import are increased significantly (since the statistics are being stored in every sample for every position with a record in the gVCF).

chapmanb commented 6 years ago

Karthik; Thanks for this, I've done that in bcbio so hopefully will avoid the issue going forward. Feel free to close here unless you want to try and trace down further what is happening. Thanks again for the pointer that led us to the underlying issue.

komalsrathi commented 6 years ago

Closing here as this was fixed: https://github.com/bcbio/bcbio-nextgen/issues/2453

cmnbroad commented 6 years ago

@kgururaj @ldgauthier I'd propose that we add defensive code to detect this, as @kgururaj proposed above. Maybe throw with a message identifying the name of the field and the issue ?