Closed jamesdalg closed 1 year ago
I've replicated this issue with GRIDSS 2.12.2 and 2.13.2. I've also seen this issue (notice the first line in the ALT field, there's a period there and GRIPSS then calls the file malformed):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample
chr1 10000 gridss0b_1b N .AACCCTAACCN 4500.73 NO_SR AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=1;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=2259.04;BASRP=89;BASSR=0;BEID=asm0-27782;BEIDH=-1;BEIDL=10;BMQ=26.25;BMQN=20.00;BMQX=42.00;BQ=4500.73;BSC=0;BSCQ=0.00;BUM=86;BUMQ=2241.69;BVF=92;CAS=0;CASQ=0.00;CQ=4832.73;EVENT=gridss0b_1;IC=0;IQ=0.00;RAS=0;RASQ=0.00;REF=23;REFPAIR=0;RP=0;RPQ=0.00;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=0 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:0.800:0.00:0:0:0:0.00:0:0.00:2259.04:89:0:4500.73:0:0.00:86:2241.69:92:0.00:0:0.00:0.00:0.00:23:0:0:0.00:0:0.00:0
chr1 10151 gridss0f_3b T TTAACCCTAACCC. 422.48 ASSEMBLY_BIAS;LOW_QUAL;NO_RP AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=1;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=390.48;BASRP=16;BASSR=0;BEID=asm0-27779;BEIDH=-1;BEIDL=0;BMQ=35.00;BMQN=32.00;BMQX=38.00;BQ=422.48;BSC=1;BSCQ=32.00;BUM=0;BUMQ=0.00;BVF=17;CAS=0;CASQ=0.00;CQ=2040.61;EVENT=gridss0f_3;IC=0;IQ=0.00;RAS=0;RASQ=0.00;REF=396;REFPAIR=1513;RP=0;RPQ=0.00;SB=1.0;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=0 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:8.827e-03:0.00:0:0:0:0.00:0:0.00:390.48:16:0:422.48:1:32.00:0:0.00:17:0.00:0:0.00:0.00:0.00:396:1513:0:0.00:0:0.00:0
chr1 10347 gridss0fb_6o A A[chr1:10359[ 122.80 LOW_QUAL;NO_ASSEMBLY AS=0;ASC=1X11N1X;ASQ=0.00;ASRP=0;ASSR=0;BA=0;BANRP=3;BANRPQ=43.78;BANSR=0;BANSRQ=0.00;BAQ=0.00;BASRP=0;BASSR=0;BMQ=38.07;BMQN=20.00;BMQX=60.00;BQ=1857.63;BSC=1;BSCQ=22.49;BUM=54;BUMQ=1835.14;BVF=55;CAS=0;CASQ=0.00;CIPOS=-6,6;CIRPOS=-6,6;CQ=122.80;EVENT=gridss0fb_6;HOMLEN=12;HOMSEQ=ACCCTAACCCTA;IC=2;IHOMPOS=-6,6;IQ=64.60;MATEID=gridss0fb_6h;MQ=32.17;MQN=22.00;MQX=40.00;RAS=0;RASQ=0.00;REF=327;REFPAIR=1314;RP=4;RPQ=58.20;SB=0.33333334;SC=104M125D41M17D44M1X11N1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=6 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:0.018:0.00:0:0:3:43.78:0:0.00:0.00:0:0:1857.63:1:22.49:54:1835.14:55:0.00:2:64.60:122.80:0.00:327:1314:4:58.20:0:0.00:6
chr1 10358 gridss0b_14b A .AACCCTAACCA 235.76 ASSEMBLY_BIAS;LOW_QUAL;NO_RP;NO_SR AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=1;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=235.76;BASRP=7;BASSR=1;BEID=asm0-2;BEIDH=-1;BEIDL=10;BMQ=40.00;BMQN=40.00;BMQX=40.00;BQ=235.76;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;BVF=8;CAS=0;CASQ=0.00;CQ=415.90;EVENT=gridss0b_14;IC=0;IQ=0.00;RAS=0;RASQ=0.00;REF=695;REFPAIR=1120;RP=0;RPQ=0.00;SB=1.0;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=0 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:4.388e-03:0.00:0:0:0:0.00:0:0.00:235.76:7:1:235.76:0:0.00:0:0.00:8:0.00:0:0.00:0.00:0.00:695:1120:0:0.00:0:0.00:0
chr1 10359 gridss0fb_6h A ]chr1:10347]A 122.80 LOW_QUAL;NO_ASSEMBLY AS=0;ASC=1X5N1X;ASQ=0.00;ASRP=0;ASSR=0;BA=0;BANRP=3;BANRPQ=43.78;BANSR=0;BANSRQ=0.00;BAQ=0.00;BASRP=0;BASSR=0;BMQ=32.36;BMQN=21.00;BMQX=40.00;BQ=407.21;BSC=4;BSCQ=100.21;BUM=10;BUMQ=307.00;BVF=14;CAS=0;CASQ=0.00;CIPOS=-6,6;CIRPOS=-6,6;CQ=122.80;EVENT=gridss0fb_6;HOMLEN=12;HOMSEQ=ACCCTAACCCTA;IC=2;IHOMPOS=-6,6;IQ=64.60;MATEID=gridss0fb_6o;MQ=32.17;MQN=22.00;MQX=40.00;RAS=0;RASQ=0.00;REF=450;REFPAIR=981;RP=4;RPQ=58.20;SB=0.16666667;SC=1X5N1X102M;SR=0;SRQ=0.00;SVTYPE=BND;VF=6 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:0.013:0.00:0:0:3:43.78:0:0.00:0.00:0:0:407.21:4:100.21:10:307.00:14:0.00:2:64.60:122.80:0.00:450:981:4:58.20:0:0.00:6
chr1 10385 gridss0fb_8o C CCTAACCCT[chr1:10394[ 54.52 LOW_QUAL;SINGLE_ASSEMBLY AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=2;BA=0;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=0.00;BASRP=0;BASSR=0;BEID=asm0-27781;BEIDH=0;BEIDL=0;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;BVF=0;CAS=0;CASQ=0.00;CQ=54.52;EVENT=gridss0fb_8;IC=0;IHOMPOS=0,0;IQ=0.00;MATEID=gridss0fb_8h;MQ=39.00;MQN=39.00;MQX=39.00;RAS=1;RASQ=54.52;REF=856;REFPAIR=840;RP=0;RPQ=0.00;SB=0.0;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=2 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:2.331e-03:0.00:0:2:0:0.00:0:0.00:0.00:0:0:0.00:0:0.00:0:0.00:0:0.00:0:0.00:54.52:54.52:856:840:0:0.00:0:0.00:2
chr1 10394 gridss0fb_8h T ]chr1:10385]CTAACCCTT 54.52 LOW_QUAL;SINGLE_ASSEMBLY AS=1;ASC=1X;ASQ=54.52;ASRP=0;ASSR=2;BA=0;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=0.00;BASRP=0;BASSR=0;BEID=asm0-27781;BEIDH=0;BEIDL=0;BMQ=30.00;BMQN=30.00;BMQX=30.00;BQ=22.26;BSC=1;BSCQ=22.26;BUM=0;BUMQ=0.00;BVF=1;CAS=0;CASQ=0.00;CQ=54.52;EVENT=gridss0fb_8;IC=0;IHOMPOS=0,0;IQ=0.00;MATEID=gridss0fb_8o;MQ=39.00;MQN=39.00;MQX=39.00;RAS=0;RASQ=0.00;REF=731;REFPAIR=725;RP=0;RPQ=0.00;SB=0.0;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=2 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:2.729e-03:54.52:0:2:0:0.00:0:0.00:0.00:0:0:22.26:1:22.26:0:0.00:1:0.00:0:0.00:54.52:0.00:731:725:0:0.00:0:0.00:2
chr1 10548 gridss0ff_3o C C]chr15:101980741] 104.01 LOW_QUAL;SINGLE_ASSEMBLY AS=1;ASC=1X174N1X;ASQ=59.54;ASRP=4;ASSR=0;BA=0;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=0.00;BASRP=0;BASSR=0;BEID=asm0-24;BEIDH=2;BEIDL=0;BMQ=30.21;BMQN=21.00;BMQX=49.00;BQ=354.49;BSC=7;BSCQ=169.49;BUM=7;BUMQ=185.00;BVF=14;CAS=0;CASQ=0.00;CIPOS=-87,88;CIRPOS=-87,88;CQ=104.01;EVENT=gridss0ff_3;IC=0;IMPRECISE;IQ=0.00;MATEID=gridss0ff_3h;MQ=26.25;MQN=20.00;MQX=38.00;RAS=0;RASQ=0.00;REF=1;REFPAIR=0;RP=3;RPQ=44.47;SB=0.0;SC=115M1X174N1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=4 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:0.800:59.54:4:0:0:0.00:0:0.00:0.00:0:0:354.49:7:169.49:7:185.00:14:0.00:0:0.00:104.01:0.00:1:0:3:44.47:0:0.00:4
-bash-4.2$
code used to generate the file:
rm -f /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.*;rm -r /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired_hg38/;mkdir -p /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired_hg38/;mkdir -p /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/;module load gridss/2.12.2 samtools R java/17.0.2; mkdir -p /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired_hg38/; gridss -r /data/CCRBioinfo/dalgleishjl/sv_mapping/hg38_ref/hg38.fa -w /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired_hg38/ -a /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired_hg38/PALZGU_T_N_paired_assembly_hg38.bam -o /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz -t 32 /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_pipeline/bam_hg38/PALZGU_N.bam /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_pipeline/bam_hg38/PALZGU_T.bam;
[+] Loading gridss 2.12.2 on cn3160
[+] Loading singularity 3.8.5-1 on cn3160
[-] Unloading samtools 1.15 ...
[+] Loading samtools 1.15 ...
[+] Loading gcc 9.2.0 ...
[+] Loading GSL 2.6 for GCC 9.2.0 ...
[-] Unloading gcc 9.2.0 ...
[+] Loading gcc 9.2.0 ...
[+] Loading openmpi 4.0.5 for GCC 9.2.0
[+] Loading ImageMagick 7.0.8 on cn3160
[+] Loading HDF5 1.10.4
[-] Unloading gcc 9.2.0 ...
[+] Loading gcc 9.2.0 ...
[+] Loading NetCDF 4.7.4_gcc9.2.0
[+] Loading pandoc 2.17.1.1 on cn3160
[+] Loading pcre2 10.21 ...
[+] Loading R 4.1.3
[+] Loading java 17.0.2 ...
Using working directory "/data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired_hg38/"
Fri May 6 06:33:51 EDT 2022: Full log file is: /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired_hg38/gridss.full.20220506_063351.cn3160.2696.log
Fri May 6 06:33:51 EDT 2022: Found /usr/bin/time
Fri May 6 06:33:51 EDT 2022: Using GRIDSS jar /opt/gridss/gridss-2.12.2-gridss-jar-with-dependencies.jar
Fri May 6 06:33:51 EDT 2022: Using reference genome "/data/CCRBioinfo/dalgleishjl/sv_mapping/hg38_ref/hg38.fa"
Fri May 6 06:33:51 EDT 2022: Using output VCF /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz
Fri May 6 06:33:51 EDT 2022: Using assembly bam /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired_hg38/PALZGU_T_N_paired_assembly_hg38.bam
Fri May 6 06:33:51 EDT 2022: WARNING: GRIDSS scales sub-linearly at high thread count. Up to 8 threads is the recommended level of parallelism.
Fri May 6 06:33:51 EDT 2022: Using 32 worker threads.
Fri May 6 06:33:51 EDT 2022: Using no blacklist bed. The encode DAC blacklist is recommended for hg19.
Fri May 6 06:33:51 EDT 2022: Using JVM maximum heap size of 30g for assembly and variant calling.
Fri May 6 06:33:51 EDT 2022: Using input file /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_pipeline/bam_hg38/PALZGU_N.bam
Fri May 6 06:33:51 EDT 2022: Using input file /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_pipeline/bam_hg38/PALZGU_T.bam
Fri May 6 06:33:51 EDT 2022: Found /usr/bin/Rscript
Fri May 6 06:33:51 EDT 2022: Found /usr/bin/samtools
Fri May 6 06:33:51 EDT 2022: Found /usr/bin/java
Fri May 6 06:33:51 EDT 2022: Found /usr/bin/bwa
Fri May 6 06:33:51 EDT 2022: samtools version: 1.10+htslib-1.10.2-3
Fri May 6 06:33:51 EDT 2022: R version: R scripting front-end version 4.1.0 (2021-05-18)
Fri May 6 06:33:51 EDT 2022: bwa Version: 0.7.17-r1188
Fri May 6 06:33:51 EDT 2022: time version: GNU time 1.7
Fri May 6 06:33:51 EDT 2022: bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
Fri May 6 06:33:52 EDT 2022: java version: openjdk version "11.0.11" 2021-04-20 OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04) OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing)
Fri May 6 06:33:52 EDT 2022: Max file handles: 131072
Fri May 6 06:33:52 EDT 2022: Running GRIDSS steps: setupreference, preprocess, assemble, call,
Fri May 6 06:33:52 EDT 2022: Start pre-processing /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_pipeline/bam_hg38/PALZGU_N.bam
Fri May 6 06:33:52 EDT 2022: Running CollectGridssMetrics /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_pipeline/bam_hg38/PALZGU_N.bam first 10000000 records
Fri May 6 06:34:48 EDT 2022: Running CollectGridssMetricsAndExtractSVReads|samtools /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_pipeline/bam_hg38/PALZGU_N.bam
Fri May 6 06:55:17 EDT 2022: Running PreprocessForBreakendAssembly|samtools /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_pipeline/bam_hg38/PALZGU_N.bam
Fri May 6 07:07:36 EDT 2022: Complete pre-processing /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_pipeline/bam_hg38/PALZGU_N.bam
Fri May 6 07:07:36 EDT 2022: Start pre-processing /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_pipeline/bam_hg38/PALZGU_T.bam
Fri May 6 07:07:36 EDT 2022: Running CollectGridssMetrics /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_pipeline/bam_hg38/PALZGU_T.bam first 10000000 records
Fri May 6 07:08:43 EDT 2022: Running CollectGridssMetricsAndExtractSVReads|samtools /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_pipeline/bam_hg38/PALZGU_T.bam
Fri May 6 07:28:58 EDT 2022: Running PreprocessForBreakendAssembly|samtools /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_pipeline/bam_hg38/PALZGU_T.bam
Fri May 6 07:40:38 EDT 2022: Complete pre-processing /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_pipeline/bam_hg38/PALZGU_T.bam
Fri May 6 07:40:38 EDT 2022: Start assembly /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired_hg38/PALZGU_T_N_paired_assembly_hg38.bam
Fri May 6 07:40:38 EDT 2022: Running AssembleBreakends /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired_hg38/PALZGU_T_N_paired_assembly_hg38.bam job 0 total jobs 1
Fri May 6 08:17:15 EDT 2022: Running CollectGridssMetrics /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired_hg38/PALZGU_T_N_paired_assembly_hg38.bam
Fri May 6 08:17:30 EDT 2022: Running SoftClipsToSplitReads|samtools /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired_hg38/PALZGU_T_N_paired_assembly_hg38.bam
Fri May 6 08:21:58 EDT 2022: Complete assembly /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired_hg38/PALZGU_T_N_paired_assembly_hg38.bam
Fri May 6 08:21:58 EDT 2022: Start calling /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz
Fri May 6 08:21:58 EDT 2022: Running IdentifyVariants /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz
Fri May 6 08:26:58 EDT 2022: Running AnnotateVariants /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz
Fri May 6 09:03:24 EDT 2022: Running AnnotateInsertedSequence /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz
Fri May 6 09:05:28 EDT 2022: Complete calling /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz
Fri May 6 09:05:28 EDT 2022: Run complete with 80 warnings and 0 errors.
I've replicated this issue with GRIDSS 2.12.2 and 2.13.2
Is the output vcf actually malformed, or is GRIPSS just complaining that it is. If you decompress the .vcf.gz, do a) you get any EOF decompression errors when decompressing, and b) does GRIPSS still complain about a malformed input file if you feed it the uncompressed .vcf?
The GRIDSS log files give no indication that there was any issue on the GRIDSS side. Given that all the GRIDSS intermediate steps read the intermediate VCFs without issue, it's likely that the cause is either in the final GRIDSS annotation step, with GRIPSS, or somehow with the pipeline structure/execution environment. Running GRIDSS with --keepTempFiles
can be helpful with this sort of root cause analysis as it allows to you inspect all the intermediate files that GRIDSS uses and verify at which point something has gone wrong.
notice the first line in the ALT field, there's a period there and GRIPSS then calls the file malformed
That's perfectly valid VCF (See section 5.4.9 of https://samtools.github.io/hts-specs/VCFv4.3.pdf) and GRIPSS is design to handle VCFs that include single breakend variants. Any suggestions @charlesshale?
The only thing unusual that I can see on the GRIDSS side of things is WORKER_THREADS=32
instead of the recommended --threads 8 --jvmheap 31g
so there's potentially a hidden race condition that only shows up at high levels of parallelism but if the output .vcf.gz isn't truncated then that's not going to cause. The usual symptom when there's too many threads is progress stalling followed by OutOfMemory : GC overhead limit exceeded. If it's run to completion without error it's not going to be that.
At this point, I'm no longer experiencing truncated VCFs. It could have been a file system issue at the time. Really hard to say. Maybe it was too many threads or not enough ram. Thanks for the insight that you gave earlier. Should I consider using a larger heap size (and increased allocated ram upon submission) as well as only 8 threads? This is what I experienced at the time("unexpected end of file"):
(base) [dalgleishjl@cn0904 snakemake-gridss]$ zcat /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALHRL_T_N_gridss_paired/gridss_PALHRL_T_N_paired_output_hg38.vcf.gz | tail
gzip: /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALHRL_T_N_gridss_paired/gridss_PALHRL_T_N_paired_output_hg38.vcf.gz: ### **unexpected end of file**
chr1 790807 gridss0b_152b G .GAATGGACTCAAATGGAATAGAATTGACTCGAGTGGAAAG 223.84 ASSEMBLY_BIAS;LOW_QUAL;NO_RP AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=1;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=0.00;BASRP=0;BASSR=0;BEID=asm0-16488;BEIDH=-1;BEIDL=103;BMQ=28.73;BMQN=20.00;BMQX=40.00;BQ=223.84;BSC=10;BSCQ=223.84;BUM=0;BUMQ=0.00;BVF=10;CAS=0;CASQ=0.00;CQ=286.84;EVENT=gridss0b_152;IC=0;IQ=0.00;RAS=0;RASQ=0.00;REF=345;REFPAIR=224;RP=0;RPQ=0.00;SB=1.0;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=0 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:5.814e-03:0.00:0:0:0:0.00:0:0.00:0.00:0:0:40.79:2:40.79:0:0.00:2:0.00:0:0.00:0.00:0.00:213:129:0:0.00:0:0.00:0 .:0.034:0.00:0:0:0:0.00:0:0.00:0.00:0:0:183.05:8:183.05:0:0.00:8:0.00:0:0.00:0.00:0.00:132:95:0:0.00:0:0.00:0
chr1 790853 gridss0f_147b G GTTTGGAAAGGACAAAAATGGAATGGAATAGAATGGAATGGAATGGAATG. 500.27 LOW_QUAL AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=1;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=159.08;BASRP=8;BASSR=0;BEALN=chr14_GL000225v1_random:193015|+|8S41M|0;BEID=asm0-283;BEIDH=-1;BEIDL=0;BMQ=29.43;BMQN=20.00;BMQX=51.00;BQ=500.27;BSC=11;BSCQ=291.19;BUM=2;BUMQ=50.00;BVF=11;CAS=0;CASQ=0.00;CQ=758.48;EVENT=gridss0f_147;IC=0;IQ=0.00;RAS=0;RASQ=0.00;REF=182;REFPAIR=210;RP=0;RPQ=0.00;SB=0.90909094;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=0 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:8.584e-03:0.00:0:0:0:0.00:0:0.00:36.93:2:0:87.87:2:50.93:0:0.00:2:0.00:0:0.00:0.00:0.00:118:113:0:0.00:0:0.00:0 .:0.053:0.00:0:0:0:0.00:0:0.00:122.14:6:0:412.40:9:240.25:2:50.00:9:0.00:0:0.00:0.00:0.00:64:97:0:0.00:0:0.00:0
chr1 790955 gridss0fb_128o A A[chr1:790966[ 1726.77 NO_ASSEMBLY AS=0;ASC=1X44N1X;ASQ=0.00;ASRP=0;ASSR=0;BA=0;BANRP=9;BANRPQ=163.55;BANSR=0;BANSRQ=0.00;BAQ=0.00;BASRP=0;BASSR=0;BMQ=25.00;BMQN=22.00;BMQX=27.00;BQ=67.09;BSC=1;BSCQ=18.09;BUM=2;BUMQ=49.00;BVF=3;CAS=0;CASQ=0.00;CIPOS=-22,23;CIRPOS=-22,23;CQ=1726.77;EVENT=gridss0fb_128;HOMLEN=45;HOMSEQ=GAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATG;IC=46;IHOMPOS=-10,10;IQ=1508.63;MATEID=gridss0fb_128h;MQ=52.48;MQN=20.00;MQX=60.00;RAS=0;RASQ=0.00;REF=275;REFPAIR=164;RP=12;RPQ=218.15;SB=0.34042552;SC=68M59D95M37D42M1X44N1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=58 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:0.135:0.00:0:0:2:36.93:0:0.00:0.00:0:0:67.09:1:18.09:2:49.00:3:0.00:23:734.44:771.38:0.00:160:96:2:36.93:0:0.00:25 .:0.223:0.00:0:0:7:126.62:0:0.00:0.00:0:0:0.00:0:0.00:0:0.00:0:0.00:23:774.18:955.40:0.00:115:68:10:181.21:0:0.00:33
chr1 790966 gridss0fb_128h A ]chr1:790955]A 1726.77 NO_ASSEMBLY AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=0;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=0.00;BASRP=0;BASSR=0;BMQ=40.00;BMQN=40.00;BMQX=40.00;BQ=39.74;BSC=0;BSCQ=0.00;BUM=1;BUMQ=39.74;BVF=1;CAS=0;CASQ=0.00;CIPOS=-22,23;CIRPOS=-22,23;CQ=1726.77;EVENT=gridss0fb_128;HOMLEN=45;HOMSEQ=GAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATG;IC=46;IHOMPOS=-10,10;IQ=1508.63;MATEID=gridss0fb_128o;MQ=52.48;MQN=20.00;MQX=60.00;RAS=0;RASQ=0.00;REF=179;REFPAIR=164;RP=12;RPQ=218.15;SB=0.3478261;SC=1X66M138D102M17D27M;SR=0;SRQ=0.00;SVTYPE=BND;VF=58 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:0.185:0.00:0:0:0:0.00:0:0.00:0.00:0:0:39.74:0:0.00:1:39.74:1:0.00:23:734.44:771.38:0.00:110:96:2:36.93:0:0.00:25 .:0.324:0.00:0:0:0:0.00:0:0.00:0.00:0:0:0.00:0:0.00:0:0.00:0:0.00:23:774.18:955.40:0.00:69:68:10:181.21:0:0.00:33
chr1 790985 gridss0f_152b A ACACAAATTGAATGGAATGAAATGGAAC. 437.29 LOW_QUAL;NO_RP AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=1;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=170.02;BASRP=0;BASSR=8;BEID=asm0-212;BEIDH=-1;BEIDL=0;BMQ=52.43;BMQN=24.00;BMQX=60.00;BQ=437.29;BSC=13;BSCQ=267.27;BUM=0;BUMQ=0.00;BVF=13;CAS=0;CASQ=0.00;CQ=493.24;EVENT=gridss0f_152;IC=0;IQ=0.00;RAS=0;RASQ=0.00;REF=262;REFPAIR=244;RP=0;RPQ=0.00;SB=0.2857143;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=0 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:0.023:0.00:0:0:0:0.00:0:0.00:107.53:0:5:252.39:7:144.87:0:0.00:7:0.00:0:0.00:0.00:0.00:151:140:0:0.00:0:0.00:0 .:0.027:0.00:0:0:0:0.00:0:0.00:62.49:0:3:184.90:6:122.41:0:0.00:6:0.00:0:0.00:0.00:0.00:111:104:0:0.00:0:0.00:0
chr1 791240 gridss0fb_135o G GCACGT[chr1:791246[ 143.57 LOW_QUAL;SINGLE_ASSEMBLY AS=1;ASC=1X;ASQ=118.57;ASRP=0;ASSR=5;BA=0;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=0.00;BASRP=0;BASSR=0;BEID=asm0-159;BEIDH=0;BEIDL=0;BMQ=35.33;BMQN=33.00;BMQX=40.00;BQ=76.38;BSC=3;BSCQ=76.38;BUM=0;BUMQ=0.00;BVF=1;CAS=0;CASQ=0.00;CQ=143.57;EVENT=gridss0fb_135;IC=1;IHOMPOS=0,0;IQ=25.00;MATEID=gridss0fb_135h;MQ=32.50;MQN=25.00;MQX=40.00;RAS=0;RASQ=0.00;REF=768;REFPAIR=229;RP=0;RPQ=0.00;SB=1.0;SC=35M5D9M1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=5 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:4.566e-03:49.82:0:2:0:0.00:0:0.00:0.00:0:0:32.63:1:32.63:0:0.00:1:0.00:0:0.00:49.82:0.00:436:119:0:0.00:0:0.00:2 .:8.955e-03:68.75:0:3:0:0.00:0:0.00:0.00:0:0:43.75:2:43.75:0:0.00:0:0.00:1:25.00:93.75:0.00:332:110:0:0.00:0:0.00:3
chr1 791246 gridss0fb_135h G ]chr1:791240]CACGTG 143.57 LOW_QUAL;SINGLE_ASSEMBLY AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=5;BA=0;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=0.00;BASRP=0;BASSR=0;BEID=asm0-159;BEIDH=0;BEIDL=0;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;BVF=0;CAS=0;CASQ=0.00;CQ=143.57;EVENT=gridss0fb_135;IC=1;IHOMPOS=0,0;IQ=25.00;MATEID=gridss0fb_135o;MQ=32.50;MQN=25.00;MQX=40.00;RAS=1;RASQ=118.57;REF=650;REFPAIR=233;RP=0;RPQ=0.00;SB=1.0;SC=1X23M;SR=0;SRQ=0.00;SVTYPE=BND;VF=5 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:5.376e-03:0.00:0:2:0:0.00:0:0.00:0.00:0:0:0.00:0:0.00:0:0.00:0:0.00:0:0.00:49.82:49.82:370:118:0:0.00:0:0.00:2 .:0.011:0.00:0:3:0:0.00:0:0.00:0.00:0:0:0.00:0:0.00:0:0.00:0:0.00:1:25.00:93.75:68.75:280:115:0:0.00:0:0.00:3
chr1 791320 gridss0b_171b G .ATGGAAAGGAATGGACCCGAATATCATGGAATAGAATGCAAAGG 668.99 ASSEMBLY_BIAS;LOW_QUAL AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=1;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=0.00;BASRP=0;BASSR=0;BEID=asm0-16480;BEIDH=-1;BEIDL=248;BMQ=36.54;BMQN=21.00;BMQX=60.00;BQ=668.99;BSC=11;BSCQ=291.99;BUM=12;BUMQ=377.00;BVF=23;CAS=0;CASQ=0.00;CQ=3324.09;EVENT=gridss0b_171;IC=0;IQ=0.00;RAS=0;RASQ=0.00;REF=111;REFPAIR=431;RP=0;RPQ=0.00;SB=1.0;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=0 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:0.043:0.00:0:0:0:0.00:0:0.00:0.00:0:0:341.24:10:262.24:3:79.00:13:0.00:0:0.00:0.00:0.00:68:220:0:0.00:0:0.00:0 .:0.038:0.00:0:0:0:0.00:0:0.00:0.00:0:0:327.76:1:29.76:9:298.00:10:0.00:0:0.00:0.00:0.00:43:211:0:0.00:0:0.00:0
chr1 791361 gridss0f_168b A ATTTCAATGGACTTGAAAACAATGGAATGGAAGACAATGGAATG. 585.92 LOW_QUAL AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=1;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=188.62;BASRP=7;BASSR=0;BEID=asm0-279;BEIDH=-1;BEIDL=0;BMQ=38.80;BMQN=26.00;BMQX=45.00;BQ=585.92;BSC=10;BSCQ=264.20;BUM=4;BUMQ=133.09;BVF=17;CAS=0;CASQ=0.00;CQ=1263.92;EVENT=gridss0f_168;IC=0;IQ=0.00;RAS=0;RASQ=0.00;REF=245;REFPAIR=344;RP=0;RPQ=0.00;SB=1.0;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=0 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:0.042:0.00:0:0:0:0.00:0:0.00:121.53:5:0:426.23:9:238.69:2:66.00:14:0.00:0:0.00:0.00:0.00:137:181:0:0.00:0:0.00:0 .:0.011:0.00:0:0:0:0.00:0:0.00:67.09:2:0:159.69:1:25.51:2:67.09:3:0.00:0:0.00:0.00:0.00:108:163:0:0.00:0:0.00:0
chr1 791543 gridss0b_173b T .GGACACAAATGGAATGGAAT 1357.74 LOW_QUAL AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=5;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=845.55;BASRP=14;BASSR=25;BEID=asm0-16483,asm0-16487,asm0-16517,asm0-16524,asm0-167;BEIDH=-1,-1,-1,-1,-1;BEIDL=120,159,19,18,9;BMQ=36.41;BMQN=22.00;BMQX=60.00;BQ=1357.74;BSC=17;BSCQ=361.19;BUM=5;BUMQ=151.00;BVF=44;CAS=0;CASQ=0.00;CQ=3699.29;EVENT=gridss0b_173;IC=0;IQ=0.00;RAS=0;RASQ=0.00;REF=386;REFPAIR=325;RP=0;RPQ=0.00;SB=0.1904762;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=0 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:0.078:0.00:0:0:0:0.00:0:0.00:614.89:13:16:1004.52:11:238.62:5:151.00:34:0.00:0:0.00:0.00:0.00:232:170:0:0.00:0:0.00:0 .:0.031:0.00:0:0:0:0.00:0:0.00:230.65:1:9:353.22:6:122.57:0:0.00:1.00:1.00:1.00:124;RP=0;RPQ=0.00;11111111116;1117P=7;BASSR=0;BEID=as1=32.5Q=845.55;1111111111111111Q:211:0:0111111111116;1117Q=585.92;BSC=10;BSC100;SBasm0-167;1111111111111111QTGGAAGAC111111111116;1117.92;EVENT=gridss0f_1ANRPQX=60.00;B1111111111111111QANRP=0;B111111111116;1117;RPQ=0.00;SB=1.0;SC1RASQ:00;CQ=3691111111111111111Q279;BEID111111111116;1117RP:BANRPQ:BANSR:BAN10:0.0R=325;RP=1111111111111111Q64.20;BU111111111116;1117UAL:RASQ:REF:REFPAI111:0.SQ:ASRP:A11111111111111
What I am now dealing with is this malformed error:
The following GRIPSS output calls the file malformed and gives specifics:
15:24:07.951 [WARN ] SV PON not ordered: last(157419-157419) vs this(157394-157424)
15:24:07.951 [WARN ] SV PON not ordered: last(165331-165331) vs this(165306-165346)
15:24:07.951 [INFO ] loaded 3103381 germline SV PON records from file(/data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/ponhg38/gridss_pon_breakpoint.38.bedpe)
15:24:08.857 [INFO ] loaded 1520513 germline SGL PON records from file(/data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/ponhg38/gridss_pon_single_breakend.38.bed)
15:24:08.872 [INFO ] loaded 446 known hotspot records from file
15:24:09.031 [INFO ] sample(PALZGU_T) processing VCF(/data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz.repeat.vcf.gz)
15:24:09.033 [INFO ] genetype info: ref(0: PALZGU_N) tumor(1: PALZGU_T)
Exception in thread "main" htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 2894: there are 1 genotypes while the header requires that 2 genotypes be present for all records at chr1:10000
at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:887)
at htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:759)
at htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:121)
at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:158)
at htsjdk.variant.variantcontext.LazyGenotypesContext.getGenotypes(LazyGenotypesContext.java:148)
at htsjdk.variant.variantcontext.GenotypesContext.get(GenotypesContext.java:417)
at htsjdk.variant.variantcontext.VariantContext.getGenotype(VariantContext.java:1102)
at com.hartwig.hmftools.gripss.filters.HardFilters.belowMinQual(HardFilters.java:46)
at com.hartwig.hmftools.gripss.filters.HardFilters.isFiltered(HardFilters.java:35)
at com.hartwig.hmftools.gripss.VariantBuilder.checkCreateVariant(VariantBuilder.java:59)
at com.hartwig.hmftools.gripss.GripssApplication.processVariant(GripssApplication.java:307)
at com.hartwig.hmftools.gripss.GripssApplication.lambda$processVcf$0(GripssApplication.java:141)
at java.base/java.lang.Iterable.forEach(Iterable.java:75)
at com.hartwig.hmftools.gripss.GripssApplication.processVcf(GripssApplication.java:141)
at com.hartwig.hmftools.gripss.GripssApplication.run(GripssApplication.java:108)
at com.hartwig.hmftools.gripss.GripssApplication.main(GripssApplication.java:336)
(base) [dalgleishjl@cn3160 snakemake-gridss]$ gzcat /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz.repeat.vcf.gz | head -n 2894 | tail -n 10
##contig=<ID=HPV-mKN1,length=7300>
##contig=<ID=HPV-mKN2,length=7299>
##contig=<ID=HPV-mKN3,length=7251>
##contig=<ID=HPV-mL55,length=7177>
##contig=<ID=HPV-mRTRX7,length=7731>
##contig=<ID=HPV-mSD2,length=7300>
##gridssVersion=2.12.2-gridss
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PALZGU_N PALZGU_T
chr1 10000 gridss0b_1b N .AACCCTAACCN 4500.73 NO_SR AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=1;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=2259.04;BASRP=89;BASSR=0;BEID=asm0-27782;BEIDH=-1;BEIDL=10;BMQ=26.25;BMQN=20.00;BMQX=42.00;BQ=4500.73;BSC=0;BSCQ=0.00;BUM=86;BUMQ=2241.69;BVF=92;CAS=0;CASQ=0.00;CQ=4832.73;EVENT=gridss0b_1;IC=0;IQ=0.00;RAS=0;RASQ=0.00;REF=23;REFPAIR=0;RP=0;RPQ=0.00;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=0 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:0.800:0.00:0:0:0:0.00:0:0.00:2259.04:89:0:4500.73:0:0.00:86:2241.69:92:0.00:0:0.00:0.00:0.00:23:0:0:0.00:0:0.00:0
chr1 10151 gridss0f_3b T TTAACCCTAACCC. 422.48 ASSEMBLY_BIAS;LOW_QUAL;NO_RP AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=1;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=390.48;BASRP=16;BASSR=0;BEID=asm0-27779;BEIDH=-1;BEIDL=0;BMQ=35.00;BMQN=32.00;BMQX=38.00;BQ=422.48;BSC=1;BSCQ=32.00;BUM=0;BUMQ=0.00;BVF=17;CAS=0;CASQ=0.00;CQ=2040.61;EVENT=gridss0f_3;IC=0;IQ=0.00;RAS=0;RASQ=0.00;REF=396;REFPAIR=1513;RP=0;RPQ=0.00;SB=1.0;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=0 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:8.827e-03:0.00:0:0:0:0.00:0:0.00:390.48:16:0:422.48:1:32.00:0:0.00:17:0.00:0:0.00:0.00:0.00:396:1513:0:0.00:0:0.00:0
(base) [dalgleishjl@cn3160 snakemake-gridss]$
This is the code that generated it:
module load gridss samtools R java/17.0.2 bcftools repeatmasker;
chmod -w /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz;
echo 'PALZGU_N' > PALZGU_sample_names.txt;
echo 'PALZGU_T' >> PALZGU_sample_names.txt;
#REHEADER with correct samples
bcftools reheader /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz -s PALZGU_sample_names.txt -o /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz.reheadered.vcf.gz;
cp /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz.reheadered.vcf.gz /lscratch/$SLURM_JOB_ID/;
mkdir -p /lscratch/$SLURM_JOBID/repeatmasker/;
/data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/scripts/gridss_annotate_vcf_repeatmasker -w /lscratch/$SLURM_JOBID/repeatmasker -t 32 -j /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_pipeline/gridss-2.13.2-gridss-jar-with-dependencies.jar -o /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz.repeat.vcf.gz /lscratch/$SLURM_JOB_ID/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz.reheadered.vcf.gz;
module load gridss samtools R java/17.0.2 bcftools repeatmasker; java -jar /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/gripss/gripss_v2.1.jar -sample PALZGU_T -reference PALZGU_N -ref_genome /data/CCRBioinfo/dalgleishjl/sv_mapping/hg38_ref/hg38.fa -pon_sv_file /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/ponhg38/gridss_pon_breakpoint.38.bedpe -pon_sgl_file /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/ponhg38/gridss_pon_single_breakend.38.bed -known_hotspot_file /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/external_resources/HMFTools-Resources/Known-Fusions/38/known_fusions.38.bedpe -vcf /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz.repeat.vcf.gz -output_dir /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/;
Trying the uncompressed version gives the same error:
15:37:35.550 [WARN ] SV PON not ordered: last(165331-165331) vs this(165306-165346)
15:37:35.551 [INFO ] loaded 3103381 germline SV PON records from file(/data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/ponhg38/gridss_pon_breakpoint.38.bedpe)
15:37:36.435 [INFO ] loaded 1520513 germline SGL PON records from file(/data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/ponhg38/gridss_pon_single_breakend.38.bed)
15:37:36.437 [INFO ] loaded 446 known hotspot records from file
15:37:36.498 [INFO ] sample(PALZGU_T) processing VCF(/data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz.repeat.vcf)
15:37:36.501 [INFO ] genetype info: ref(0: PALZGU_N) tumor(1: PALZGU_T)
Exception in thread "main" htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 2894: there are 1 genotypes while the header requires that 2 genotypes be present for all records at chr1:10000
at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:887)
at htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:759)
at htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:121)
at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:158)
at htsjdk.variant.variantcontext.LazyGenotypesContext.getGenotypes(LazyGenotypesContext.java:148)
at htsjdk.variant.variantcontext.GenotypesContext.get(GenotypesContext.java:417)
at htsjdk.variant.variantcontext.VariantContext.getGenotype(VariantContext.java:1102)
at com.hartwig.hmftools.gripss.filters.HardFilters.belowMinQual(HardFilters.java:46)
at com.hartwig.hmftools.gripss.filters.HardFilters.isFiltered(HardFilters.java:35)
at com.hartwig.hmftools.gripss.VariantBuilder.checkCreateVariant(VariantBuilder.java:59)
at com.hartwig.hmftools.gripss.GripssApplication.processVariant(GripssApplication.java:307)
at com.hartwig.hmftools.gripss.GripssApplication.lambda$processVcf$0(GripssApplication.java:141)
at java.base/java.lang.Iterable.forEach(Iterable.java:75)
at com.hartwig.hmftools.gripss.GripssApplication.processVcf(GripssApplication.java:141)
at com.hartwig.hmftools.gripss.GripssApplication.run(GripssApplication.java:108)
at com.hartwig.hmftools.gripss.GripssApplication.main(GripssApplication.java:336)
Trying the version without the header added results in a header that says there are no sample names.
(base) [dalgleishjl@cn3160 snakemake-gridss]$ module load gridss samtools R java/17.0.2 bcftools repeatmasker; java -jar /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/gripss/gripss_v2.1.jar -sample PALZGU_T -reference PALZGU_N -ref_genome /data/CCRBioinfo/dalgleishjl/sv_mapping/hg38_ref/hg38.fa -pon_sv_file /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/ponhg38/gridss_pon_breakpoint.38.bedpe -pon_sgl_file /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/ponhg38/gridss_pon_single_breakend.38.bed -known_hotspot_file /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/external_resources/HMFTools-Resources/Known-Fusions/38/known_fusions.38.bedpe -vcf /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz -output_dir /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/;
15:38:50.205 [INFO ] loaded 3103381 germline SV PON records from file(/data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/ponhg38/gridss_pon_breakpoint.38.bedpe)
15:38:51.090 [INFO ] loaded 1520513 germline SGL PON records from file(/data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/ponhg38/gridss_pon_single_breakend.38.bed)
15:38:51.092 [INFO ] loaded 446 known hotspot records from file
15:38:51.340 [INFO ] sample(PALZGU_T) processing VCF(/data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz)
**15:38:51.341 [ERROR] missing sample names in VCF: [sample]**
Trying the version that has not had the repeats annotated results in the same error:
(base) [dalgleishjl@cn3160 snakemake-gridss]$ module load gridss samtools R java/17.0.2 bcftools repeatmasker; java -jar /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/gripss/gripss_v2.1.jar -sample PALZGU_T -reference PALZGU_N -ref_genome /data/CCRBioinfo/dalgleishjl/sv_mapping/hg38_ref/hg38.fa -pon_sv_file /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/ponhg38/gridss_pon_breakpoint.38.bedpe -pon_sgl_file /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/ponhg38/gridss_pon_single_breakend.38.bed -known_hotspot_file /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/external_resources/HMFTools-Resources/Known-Fusions/38/known_fusions.38.bedpe -vcf /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz.reheadered.vcf.gz -output_dir /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/;
...
15:40:50.833 [WARN ] SV PON not ordered: last(157419-157419) vs this(157394-157424)
15:40:50.833 [WARN ] SV PON not ordered: last(165331-165331) vs this(165306-165346)
15:40:50.833 [INFO ] loaded 3103381 germline SV PON records from file(/data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/ponhg38/gridss_pon_breakpoint.38.bedpe)
15:40:51.741 [INFO ] loaded 1520513 germline SGL PON records from file(/data/CCRBioinfo/dalgleishjl/sv_mapping/gridss/ponhg38/gridss_pon_single_breakend.38.bed)
15:40:51.743 [INFO ] loaded 446 known hotspot records from file
15:40:51.845 [INFO ] sample(PALZGU_T) processing VCF(/data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz.reheadered.vcf.gz)
15:40:51.847 [INFO ] genetype info: ref(0: PALZGU_N) tumor(1: PALZGU_T)
Exception in thread "main" htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 2894: there are 1 genotypes while the header requires that 2 genotypes be present for all records at chr1:10000
at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:887)
at htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:759)
at htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:121)
at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:158)
at htsjdk.variant.variantcontext.LazyGenotypesContext.getGenotypes(LazyGenotypesContext.java:148)
at htsjdk.variant.variantcontext.GenotypesContext.get(GenotypesContext.java:417)
at htsjdk.variant.variantcontext.VariantContext.getGenotype(VariantContext.java:1102)
at com.hartwig.hmftools.gripss.filters.HardFilters.belowMinQual(HardFilters.java:46)
at com.hartwig.hmftools.gripss.filters.HardFilters.isFiltered(HardFilters.java:35)
at com.hartwig.hmftools.gripss.VariantBuilder.checkCreateVariant(VariantBuilder.java:59)
at com.hartwig.hmftools.gripss.GripssApplication.processVariant(GripssApplication.java:307)
at com.hartwig.hmftools.gripss.GripssApplication.lambda$processVcf$0(GripssApplication.java:141)
at java.base/java.lang.Iterable.forEach(Iterable.java:75)
at com.hartwig.hmftools.gripss.GripssApplication.processVcf(GripssApplication.java:141)
at com.hartwig.hmftools.gripss.GripssApplication.run(GripssApplication.java:108)
at com.hartwig.hmftools.gripss.GripssApplication.main(GripssApplication.java:336)
I can try looking at temp files if you like. What specifically should I look at?
I also stumbled on this. Maybe downstream tools can handle it, but there appears to be a single line 2893 that is flagged for a possible number of fields being off. Maybe this is a good lead into finding what's wrong. I hope so!
(base) [dalgleishjl@cn3160 snakemake-gridss]$ vcf-validator /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz.reheadered.vcf.gz
The header tag 'reference' not present. (Not required but highly recommended.)
Wrong number of fieldsin /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz.reheadered.vcf.gz; expected 11, got 10. The offending line was:
[chr1 10000 gridss0b_1b N .AACCCTAACCN 4500.73 NO_SR AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=1;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=2259.04;BASRP=89;BASSR=0;BEID=asm0-27782;BEIDH=-1;BEIDL=10;BMQ=26.25;BMQN=20.00;BMQX=42.00;BQ=4500.73;BSC=0;BSCQ=0.00;BUM=86;BUMQ=2241.69;BVF=92;CAS=0;CASQ=0.00;CQ=4832.73;EVENT=gridss0b_1;IC=0;IQ=0.00;RAS=0;RASQ=0.00;REF=23;REFPAIR=0;RP=0;RPQ=0.00;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=0 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:0.800:0.00:0:0:0:0.00:0:0.00:2259.04:89:0:4500.73:0:0.00:86:2241.69:92:0.00:0:0.00:0.00:0.00:23:0:0:0.00:0:0.00:0]
at /usr/local/apps/vcftools/0.1.16/lib/perl5/site_perl/5.24.3/Vcf.pm line 172, <__ANONIO__> line 2893.
Vcf::throw(Vcf4_2=HASH(0x813f88), "Wrong number of fieldsin /data/CCRBioinfo/dalgleishjl/sv_mapp"...) called at /usr/local/apps/vcftools/0.1.16/lib/perl5/site_perl/5.24.3/Vcf.pm line 507
VcfReader::next_data_hash(Vcf4_2=HASH(0x813f88), ARRAY(0xaaec88)) called at /usr/local/apps/vcftools/0.1.16/lib/perl5/site_perl/5.24.3/Vcf.pm line 3479
Vcf4_1::next_data_hash(Vcf4_2=HASH(0x813f88), ARRAY(0xaaec88)) called at /usr/local/apps/vcftools/0.1.16/lib/perl5/site_perl/5.24.3/Vcf.pm line 2586
VcfReader::run_validation(Vcf4_2=HASH(0x813f88)) called at /usr/local/apps/vcftools/0.1.16/bin/vcf-validator line 60
main::do_validation(HASH(0x7d3e18)) called at /usr/local/apps/vcftools/0.1.16/bin/vcf-validator line 14
(base) [dalgleishjl@cn3160 snakemake-gridss]$ vcf-validator /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz.repeat.vcf.gz
The header tag 'reference' not present. (Not required but highly recommended.)
Wrong number of fieldsin /data/CCRBioinfo/dalgleishjl/sv_mapping/gridss_purple_linx/PALZGU_T_N_gridss_paired/gridss_PALZGU_T_N_paired_output_hg38.vcf.gz.repeat.vcf.gz; expected 11, got 10. The offending line was:
[chr1 10000 gridss0b_1b N .AACCCTAACCN 4500.73 NO_SR AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=1;BANRP=0;BANRPQ=0.00;BANSR=0;BANSRQ=0.00;BAQ=2259.04;BASRP=89;BASSR=0;BEID=asm0-27782;BEIDH=-1;BEIDL=10;BMQ=26.25;BMQN=20.00;BMQX=42.00;BQ=4500.73;BSC=0;BSCQ=0.00;BUM=86;BUMQ=2241.69;BVF=92;CAS=0;CASQ=0.00;CQ=4832.73;EVENT=gridss0b_1;IC=0;IQ=0.00;RAS=0;RASQ=0.00;REF=23;REFPAIR=0;RP=0;RPQ=0.00;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=0 GT:AF:ASQ:ASRP:ASSR:BANRP:BANRPQ:BANSR:BANSRQ:BAQ:BASRP:BASSR:BQ:BSC:BSCQ:BUM:BUMQ:BVF:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ:VF .:0.800:0.00:0:0:0:0.00:0:0.00:2259.04:89:0:4500.73:0:0.00:86:2241.69:92:0.00:0:0.00:0.00:0.00:23:0:0:0.00:0:0.00:0]
at /usr/local/apps/vcftools/0.1.16/lib/perl5/site_perl/5.24.3/Vcf.pm line 172, <__ANONIO__> line 2893.
Vcf::throw(Vcf4_2=HASH(0x813f88), "Wrong number of fieldsin /data/CCRBioinfo/dalgleishjl/sv_mapp"...) called at /usr/local/apps/vcftools/0.1.16/lib/perl5/site_perl/5.24.3/Vcf.pm line 507
VcfReader::next_data_hash(Vcf4_2=HASH(0x813f88), ARRAY(0xaaef18)) called at /usr/local/apps/vcftools/0.1.16/lib/perl5/site_perl/5.24.3/Vcf.pm line 3479
Vcf4_1::next_data_hash(Vcf4_2=HASH(0x813f88), ARRAY(0xaaef18)) called at /usr/local/apps/vcftools/0.1.16/lib/perl5/site_perl/5.24.3/Vcf.pm line 2586
VcfReader::run_validation(Vcf4_2=HASH(0x813f88)) called at /usr/local/apps/vcftools/0.1.16/bin/vcf-validator line 60
main::do_validation(HASH(0x7d3e18)) called at /usr/local/apps/vcftools/0.1.16/bin/vcf-validator line 14
(base) [dalgleishjl@cn3160 snakemake-gridss]$
James, See the GRIPSS updated README and https://github.com/hartwigmedical/hmftools/issues/238 From GRIPSS 2.0, the PON needs to be sorted by ChromosomeStart and PositionStart. If you have processed GRIDSS VCFs using GRIPSS and an unordered PON, then even if it finishes, it probably is not annotating correctly.
I've had this happen several times on several samples recently. Maybe it's an easy/small thing, but I'm not understanding why GRIDSS is producing truncated files. I've included the GRIDSS log, but it seems like everything ran fine and yet the VCF is still truncated. Not sure why this is or why it keeps happening.
logs attached: gridss.full.20220502_151012.cn2414.67676.log gridss.timing.20220502_151012.cn2414.67676.log gripss_log.txt