broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.64k stars 580 forks source link

GenotypeGVCFs bcf codec error 'nps && nps*line->n_sample==n' failed #6742

Open GATKSupportTeam opened 3 years ago

GATKSupportTeam commented 3 years ago

Issue from the forum regarding GenotypeGVCFs. User is getting an error with the bcf codec. They are running 4.1.7.0, so I recommended updating to 4.1.8.1 for now to see if it is fixed.

This request was created from a contribution made by Brynjar Sigurðsson on August 04, 2020 09:49 UTC.

Link: https://gatk.broadinstitute.org/hc/en-us/community/posts/360072049511-Assertion-nps-nps-line-n-sample-n-failed

--

Hello,

I am running a variant calling using GenotypeGVCFs. The process first creates a GenomicsDB on 50Kbase regions from 150K GVCFs and then runs GenotypeGVCFs wrapped in GNU parallel (after splitting the region into as many threads as are available).

Most regions complete without a problem, but some fail on GenotypeGVCFs with the assertion error

java: /home/vagrant/GenomicsDB/dependencies/htslib/vcf.c:4225: bcf_update_format: Assertion `nps && nps*line->n_sample==n' failed.

Some of the failing regions I have run with up to 1.5 TB memory (18 threads) but they still fail.

a) GATK version used

version 4.1.7.0

b) Exact GATK commands used

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx1290240M -Xms1290240M -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar /nfs/fs1/bioinfo/apps-x86_64/GATK/gatk-4.1.7.0/gatk-package-4.1.7.0-local.jar GenomicsDBImport --genomicsdb-workspace-path /tmp/tmp.ceRdvv/GDB --intervals chr1:5149001-5201000 --tmp-dir /tmp/tmp.ceRdvv/GDB_tmp --sample-name-map /tmp/tmp.ceRdvv/snmap --batch-size 100 --reader-threads 17

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Djava.io.tmpdir=/tmp/tmp.ceRdvv -Xmx71680M -Xms71680M -jar /nfs/fs1/bioinfo/apps-x86_64/GATK/gatk-4.1.7.0/gatk-package-4.1.7.0-local.jar GenotypeGVCFs --genomicsdb-use-vcf-codec -R /odinn/data/extdata/1000genomes/2019-06-21_GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa -V gendb:///tmp/tmp.ceRdvv/GDB --tmp-dir=/tmp/tmp.ceRdvv --interval-padding 1000 --only-output-calls-starting-in-intervals -L chr1:5161113-5163890 -O /tmp/tmp.ceRdvv/splitdir/reg_5.padded.vcf.gz

c) The entire error log if applicable.

Running:

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx1290240M -Xms1290240M -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar /nfs/fs1/bioinfo/apps-x86_64/GATK/gatk-4.1.7.0/gatk-package-4.1.7.0-local.jar GenomicsDBImport --genomicsdb-workspace-path /tmp/tmp.ceRdvv/GDB --intervals chr1:5149001-5201000 --tmp-dir /tmp/tmp.ceRdvv/GDB_tmp --sample-name-map /tmp/tmp.ceRdvv/snmap --batch-size 100 --reader-threads 17

20:05:36.112 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/nfs/fs1/bioinfo/apps-x86_64/GATK/gatk-4.1.7.0/gatk-package-4.1.7.0-local.jar!/com/intel/gkl/native/libgkl_compression.so

Jul 27, 2020 8:05:40 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine

INFO: Failed to detect whether we are running on Google Compute Engine.

20:05:40.627 INFO GenomicsDBImport - ------------------------------------------------------------

20:05:40.628 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.7.0

20:05:40.628 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/

20:05:40.628 INFO GenomicsDBImport - Executing as brynjars@lhpc-1403.decode.is on Linux v3.10.0-957.5.1.el7.x86_64 amd64

20:05:40.628 INFO GenomicsDBImport - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_151-b12

20:05:40.628 INFO GenomicsDBImport - Start Date/Time: July 27, 2020 8:05:36 PM GMT

20:05:40.628 INFO GenomicsDBImport - ------------------------------------------------------------

20:05:40.628 INFO GenomicsDBImport - ------------------------------------------------------------

20:05:40.629 INFO GenomicsDBImport - HTSJDK Version: 2.21.2

20:05:40.629 INFO GenomicsDBImport - Picard Version: 2.21.9

20:05:40.629 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2

20:05:40.629 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false

20:05:40.629 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true

20:05:40.629 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false

20:05:40.629 INFO GenomicsDBImport - Deflater: IntelDeflater

20:05:40.629 INFO GenomicsDBImport - Inflater: IntelInflater

20:05:40.629 INFO GenomicsDBImport - GCS max retries/reopens: 20

20:05:40.629 INFO GenomicsDBImport - Requester pays: disabled

20:05:40.629 INFO GenomicsDBImport - Initializing engine

20:05:41.128 INFO IntervalArgumentCollection - Processing 52000 bp from intervals

20:05:41.131 INFO GenomicsDBImport - Done initializing engine

20:05:41.342 INFO GenomicsDBImport - Vid Map JSON file will be written to /tmp/tmp.ceRdvv/GDB/vidmap.json

20:05:41.342 INFO GenomicsDBImport - Callset Map JSON file will be written to /tmp/tmp.ceRdvv/GDB/callset.json

20:05:41.342 INFO GenomicsDBImport - Complete VCF Header will be written to /tmp/tmp.ceRdvv/GDB/vcfheader.vcf

20:05:41.342 INFO GenomicsDBImport - Importing to array - /tmp/tmp.ceRdvv/GDB/genomicsdb_array

20:05:41.342 INFO ProgressMeter - Starting traversal

20:05:41.342 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute

20:05:41.890 INFO GenomicsDBImport - Starting batch input file preload

20:05:42.320 INFO GenomicsDBImport - Finished batch preload

20:05:42.320 INFO GenomicsDBImport - Importing batch 1 with 100 samples

20:06:03.127 INFO ProgressMeter - chr1:5149001 0.4 1 2.8

....

03:37:31.740 INFO GenomicsDBImport - Importing batch 1502 with 19 samples

03:37:35.318 INFO GenomicsDBImport - Done importing batch 1502/1502

03:37:35.318 INFO ProgressMeter - chr1:5149001 451.9 1502 3.3

03:37:35.318 INFO ProgressMeter - Traversal complete. Processed 1502 total batches in 451.9 minutes.

03:37:35.318 INFO GenomicsDBImport - Import of all batches to GenomicsDB completed!

03:37:35.318 INFO GenomicsDBImport - Shutting down engine

[July 28, 2020 3:37:35 AM GMT] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 451.99 minutes.

Runtime.totalMemory()=1351453507584

Running:

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Djava.io.tmpdir=/tmp/tmp.ceRdvv -Xmx71680M -Xms71680M -jar /nfs/fs1/bioinfo/apps-x86_64/GATK/gatk-4.1.7.0/gatk-package-4.1.7.0-local.jar GenotypeGVCFs --genomicsdb-use-vcf-codec -R /odinn/data/extdata/1000genomes/2019-06-21_GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa -V gendb:///tmp/tmp.ceRdvv/GDB --tmp-dir=/tmp/tmp.ceRdvv --interval-padding 1000 --only-output-calls-starting-in-intervals -L chr1:5161113-5163890 -O /tmp/tmp.ceRdvv/splitdir/reg_5.padded.vcf.gz

03:37:53.320 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/nfs/fs1/bioinfo/apps-x86_64/GATK/gatk-4.1.7.0/gatk-package-4.1.7.0-local.jar!/com/intel/gkl/native/libgkl_compression.so

Jul 28, 2020 3:37:57 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine

INFO: Failed to detect whether we are running on Google Compute Engine.

03:37:57.487 INFO GenotypeGVCFs - ------------------------------------------------------------

03:37:57.488 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.1.7.0

03:37:57.488 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/

03:37:57.524 INFO GenotypeGVCFs - Executing as brynjars@lhpc-1403.decode.is on Linux v3.10.0-957.5.1.el7.x86_64 amd64

03:37:57.524 INFO GenotypeGVCFs - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_151-b12

03:37:57.524 INFO GenotypeGVCFs - Start Date/Time: July 28, 2020 3:37:53 AM GMT

03:37:57.524 INFO GenotypeGVCFs - ------------------------------------------------------------

03:37:57.524 INFO GenotypeGVCFs - ------------------------------------------------------------

03:37:57.524 INFO GenotypeGVCFs - HTSJDK Version: 2.21.2

03:37:57.524 INFO GenotypeGVCFs - Picard Version: 2.21.9

03:37:57.524 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2

03:37:57.524 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false

03:37:57.524 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true

03:37:57.525 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false

03:37:57.525 INFO GenotypeGVCFs - Deflater: IntelDeflater

03:37:57.525 INFO GenotypeGVCFs - Inflater: IntelInflater

03:37:57.525 INFO GenotypeGVCFs - GCS max retries/reopens: 20

03:37:57.525 INFO GenotypeGVCFs - Requester pays: disabled

03:37:57.525 INFO GenotypeGVCFs - Initializing engine

WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records

WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records

WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records

WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records

WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records

WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records

WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records

WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records

03:38:09.777 INFO IntervalArgumentCollection - Processing 4778 bp from intervals

03:38:09.805 INFO GenotypeGVCFs - Done initializing engine

03:38:10.173 INFO ProgressMeter - Starting traversal

03:38:10.174 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute

WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records

WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records

WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records

WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records

Chromosome chr1 position 5160169 (TileDB column 5160168) has too many alleles in the combined VCF record : 60 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

03:54:17.977 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160169

Chromosome chr1 position 5160190 (TileDB column 5160189) has too many alleles in the combined VCF record : 83 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

04:03:15.442 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160190

Chromosome chr1 position 5160195 (TileDB column 5160194) has too many alleles in the combined VCF record : 195 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

Chromosome chr1 position 5160196 (TileDB column 5160195) has too many alleles in the combined VCF record : 1290 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

04:05:31.609 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160195

04:05:35.779 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160196

Chromosome chr1 position 5160208 (TileDB column 5160207) has too many alleles in the combined VCF record : 146 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

Chromosome chr1 position 5160210 (TileDB column 5160209) has too many alleles in the combined VCF record : 52 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

04:09:35.688 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160208

04:10:06.100 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160210

Chromosome chr1 position 5160222 (TileDB column 5160221) has too many alleles in the combined VCF record : 206 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

Chromosome chr1 position 5160224 (TileDB column 5160223) has too many alleles in the combined VCF record : 8659 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

04:17:24.497 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160222

Chromosome chr1 position 5160226 (TileDB column 5160225) has too many alleles in the combined VCF record : 87 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

04:18:09.549 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160224

04:18:42.716 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160226

Chromosome chr1 position 5160230 (TileDB column 5160229) has too many alleles in the combined VCF record : 954 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

04:19:50.347 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160230

Chromosome chr1 position 5160234 (TileDB column 5160233) has too many alleles in the combined VCF record : 59 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

04:21:35.069 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160234

Chromosome chr1 position 5160240 (TileDB column 5160239) has too many alleles in the combined VCF record : 164 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

Chromosome chr1 position 5160242 (TileDB column 5160241) has too many alleles in the combined VCF record : 56 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

04:22:51.090 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160240

04:23:21.211 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160242

Chromosome chr1 position 5160246 (TileDB column 5160245) has too many alleles in the combined VCF record : 61 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

04:24:19.590 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160246

Chromosome chr1 position 5160250 (TileDB column 5160249) has too many alleles in the combined VCF record : 59 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

04:25:47.389 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160250

Chromosome chr1 position 5160256 (TileDB column 5160255) has too many alleles in the combined VCF record : 317 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

04:27:41.170 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160256

Chromosome chr1 position 5160262 (TileDB column 5160261) has too many alleles in the combined VCF record : 209 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

04:30:07.108 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160262

Chromosome chr1 position 5160266 (TileDB column 5160265) has too many alleles in the combined VCF record : 1601 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

04:33:01.269 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location chr1:5160266

Chromosome chr1 position 5160272 (TileDB column 5160271) has too many alleles in the combined VCF record : 183 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

Chromosome chr1 position 5160273 (TileDB column 5160272) has too many alleles in the combined VCF record : 25795 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

java: /home/vagrant/GenomicsDB/dependencies/htslib/vcf.c:4225: bcf_update_format: Assertion `nps && nps*line->n_sample==n' failed.

parallel: This job failed:

/nfs/fs1/bioinfo/apps-x86_64/GATK/gatk-4.1.7.0/gatk --java-options "-Djava.io.tmpdir=/tmp/tmp.ceRdvv -Xmx71680M -Xms71680M" GenotypeGVCFs --genomicsdb-use-vcf-codec -R /odinn/data/extdata/1000genomes/2019-06-21_GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa -V gendb:///tmp/tmp.ceRdvv/GDB --tmp-dir=/tmp/tmp.ceRdvv --interval-padding 1000 --only-output-calls-starting-in-intervals -L chr1:5161113-5163890 -O /tmp/tmp.ceRdvv/splitdir/reg_5.padded.vcf.gz

(created from Zendesk ticket #6566)
gz#6566

kgururaj commented 3 years ago

I suspect the issue is that some lines have a very large number of genotypes (due to some combination of a large number of alleles and ploidy). This causes all sorts of failures in the htslib library because of integer (32-bit) overflow of length fields. A quick check would be to re-run GenotypeGVCFs with the argument --max-genotype-count=10 (small number that should succeed for 150K samples). If that succeeds, we can try to figure out what's a good number for the argument (the default value is 1000).