broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 589 forks source link

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 3433: The VCF specification does not allow for whitespace in the INFO field . #6021

Open imneuro opened 5 years ago

imneuro commented 5 years ago

Hi GATK team,

I had error message as following with GATK4.1.0.0 on our local cluster: Using GATK jar /dsg_cent/packages/GATK/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar Running: java1.8 -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx5 g -jar /dsg_cent/packages/GATK/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar SelectVariants -R /dsgmnt/llfs2/masterdata/geno/hg38/resources_broad_hg38_v0_Homo_sapiens_assembl y38.fasta -L chr1 -V /dsgmnt/seq5_llfs/work/xhong/v4100/ApplyVQSR//ExcessHet_joint525_c1_22.SNP.VQSR.g.vcf.gz -O /dsgmnt/seq5_llfs/work/xhong/v4100/ApplyVQSR//ExcessHet_joi nt525_c1.SNP.VQSR.g.vcf.gz 09:15:49.372 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/dsg_cent/packages/GATK/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/nati ve/libgkl_compression.so 09:15:51.131 INFO SelectVariants - ------------------------------------------------------------ 09:15:51.132 INFO SelectVariants - The Genome Analysis Toolkit (GATK) v4.1.0.0 09:15:51.132 INFO SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/ 09:15:51.132 INFO SelectVariants - Executing as xhong@blade5-4-11.dsg.wustl.edu on Linux v2.6.32-573.12.1.el6.x86_64 amd64 09:15:51.133 INFO SelectVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_31-b13 09:15:51.133 INFO SelectVariants - Start Date/Time: June 27, 2019 9:15:49 AM CDT 09:15:51.133 INFO SelectVariants - ------------------------------------------------------------ 09:15:51.133 INFO SelectVariants - ------------------------------------------------------------ 09:15:51.134 INFO SelectVariants - HTSJDK Version: 2.18.2 09:15:51.134 INFO SelectVariants - Picard Version: 2.18.25 09:15:51.134 INFO SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2 09:15:51.135 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 09:15:51.135 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 09:15:51.135 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 09:15:51.135 INFO SelectVariants - Deflater: IntelDeflater 09:15:51.135 INFO SelectVariants - Inflater: IntelInflater 09:15:51.135 INFO SelectVariants - GCS max retries/reopens: 20 09:15:51.135 INFO SelectVariants - Requester pays: disabled 09:15:51.136 INFO SelectVariants - Initializing engine 09:15:52.547 INFO FeatureManager - Using codec VCFCodec to read file file:///dsgmnt/seq5_llfs/work/xhong/v4100/ApplyVQSR/ExcessHet_joint525_c1_22.SNP.VQSR.g.vcf.gz 09:15:53.171 INFO IntervalArgumentCollection - Processing 248956422 bp from intervals 09:15:53.221 INFO SelectVariants - Done initializing engine 09:15:53.390 INFO ProgressMeter - Starting traversal 09:15:53.390 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute 09:15:53.479 INFO SelectVariants - Shutting down engine [June 27, 2019 9:15:53 AM CDT] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.07 minutes. Runtime.totalMemory()=2131755008 htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 3433: The VCF specification does not allow for whitespace in the INFO field . Offending field value was "AC=1;AF=9.671e-04;AN=1034;AS_BaseQRankSum=-1.550;AS_FS=8.334;AS_InbreedingCoeff=-0.3147;AS_MQ=31.69;AS_MQRankSum=-0.200;AS_QD=28.73;AS_ReadPosR ankSum=nul;AS_SOR=2.235;BaseQRankSum=-1.381e+00;DP=40368;ExcessHet=160.0000;FS=8.334;InbreedingCoeff=-0.3147;MLEAC=7;MLEAF=6.770e-03;MQ=37.13;MQRankSum=0.126;QD=2.46;SOR=2. 235 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:75,0:75:0:.:.:0,0,1525

However, from the error message I cannot see any whitespace in the INFO field.

The /dsgmnt/seq5_llfs/work/xhong/v4100/ApplyVQSR/ExcessHet_joint525_c1_22.SNP.VQSR.g.vcf.gz is the output of following command:

gatk4.1.0.0 --java-options '-Xmx100g -Xmx100g' ApplyVQSR \ -R /dsgmnt/llfs2/masterdata/geno/hg38/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta \ -V ${SNPPath}/joint525_chr1_ExcessHet_filter.SNP.g.vcf.gz \ -V ${SNPPath}/joint525_chr2_ExcessHet_filter.SNP.g.vcf.gz \ .... -V ${SNPPath}/joint525_chr22_ExcessHet_filter.SNP.g.vcf.gz \ -O /dsgmnt/seq5_llfs/work/xhong/v4100/ApplyVQSR//ExcessHet_joint525_c1_22.SNP.VQSR.g.vcf.g z \ --truth-sensitivity-filter-level 97 \ --tranches-file /dsgmnt/seq5_llfs/work/xhong/v4100/VQSR//ExcessHet_joint525_c1_22.snp.tranches \ --recal-file /dsgmnt/seq5_llfs/work/xho ng/v4100/VQSR//ExcessHet_joint525_c1_22.snp.recal \ -mode SNP

There is no error or warning in the standard error and standard output of this step.

I have tried to apply VQSR SNP model to ${SNPPath}/joint525_chr1_ExcessHet_filter.SNP.g.vcf.gz. It works well. When I select BISNPs from the output, I could not repeat the error.

I would like to get suggestion on how to narrow down the problem. Any input is appreciated.

cmnbroad commented 5 years ago

@imneuro Not sure why, but there is an embedded space in that field: SOR=2. 235

Anne-oxford commented 5 years ago

I've also had a similar issue with a VCF generated from applyVQSR on gatk-4.1.3.0. Did you ever discover the reason for the whitespace?

ldgauthier commented 4 years ago

This is going to be nearly impossible to debug without being able to reproduce it. If I could get an input VCF, a commandline, and an example bad output VCF that would go a long way: https://gatk.broadinstitute.org/hc/en-us/articles/360035889671

ManavalanG commented 3 years ago

Just wanted to note that, unlike vcf spec 4.2, "Space characters are allowed in values" as per spec 4.3.

ldgauthier commented 3 years ago

While 4.3 support is on our roadmap, GATK doesn't currently support anything more recent than 4.2.