If saving a blocked gVCF where the non-reference blocks do not have likelihoods attached, we get the following error:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at htsjdk.variant.vcf.VCFEncoder.addGenotypeData(VCFEncoder.java:286)
at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:136)
at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:222)
at org.seqdoop.hadoop_bam.VCFRecordWriter.writeRecord(VCFRecordWriter.java:140)
at org.seqdoop.hadoop_bam.KeyIgnoringVCFRecordWriter.write(KeyIgnoringVCFRecordWriter.java:61)
at org.seqdoop.hadoop_bam.KeyIgnoringVCFRecordWriter.write(KeyIgnoringVCFRecordWriter.java:38)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
This tracks back to some code in htsjdk that doesn't check if an array-type FORMAT field is empty before writing it out, and which indexes directly into the 0th element, which isn't great. That said, what's happening on our side, is that the conditional that checks whether we are at a gVCF record when the genotypeLikelihood field is unset is wrong, and is setting the PL on the genotype builder to an empty array.
If saving a blocked gVCF where the non-reference blocks do not have likelihoods attached, we get the following error:
This tracks back to some code in htsjdk that doesn't check if an array-type FORMAT field is empty before writing it out, and which indexes directly into the 0th element, which isn't great. That said, what's happening on our side, is that the conditional that checks whether we are at a gVCF record when the genotypeLikelihood field is unset is wrong, and is setting the PL on the genotype builder to an empty array.