bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Apache License 2.0
1k stars 308 forks source link

IndexOutOfBounds thrown when saving gVCF with no likelihoods #1673

Closed fnothaft closed 7 years ago

fnothaft commented 7 years ago

If saving a blocked gVCF where the non-reference blocks do not have likelihoods attached, we get the following error:

Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
    at htsjdk.variant.vcf.VCFEncoder.addGenotypeData(VCFEncoder.java:286)
    at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:136)
    at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:222)
    at org.seqdoop.hadoop_bam.VCFRecordWriter.writeRecord(VCFRecordWriter.java:140)
    at org.seqdoop.hadoop_bam.KeyIgnoringVCFRecordWriter.write(KeyIgnoringVCFRecordWriter.java:61)
    at org.seqdoop.hadoop_bam.KeyIgnoringVCFRecordWriter.write(KeyIgnoringVCFRecordWriter.java:38)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

This tracks back to some code in htsjdk that doesn't check if an array-type FORMAT field is empty before writing it out, and which indexes directly into the 0th element, which isn't great. That said, what's happening on our side, is that the conditional that checks whether we are at a gVCF record when the genotypeLikelihood field is unset is wrong, and is setting the PL on the genotype builder to an empty array.

heuermh commented 7 years ago

Fixed by #1674