VCFHeaderReader BCF encoded file exception thrown for unrelated VCF header errors

heuermh commented 7 years ago

VCFHeaderReader uses a try catch to fall back to BCF encoding, which leads to incorrect error messages and stack trace if the header is actually VCF format but has unrelated errors.

E.g. Here the first exception should have been thrown (Your input file has a malformed header: Count < 0 for fixed size VCF header field BAD_PS), not logged as a warning, and the second exception should not have happened (Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0).

scala> val variants = sc.loadVariants("truth_small_variants.variants.adam")
warning: while trying to read VCF header from file received exception: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Count < 0 for fixed size VCF header field BAD_PS
htsjdk.tribble.TribbleException: Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0:
  at htsjdk.variant.bcf2.BCF2Codec.error(BCF2Codec.java:478)
  at htsjdk.variant.bcf2.BCF2Codec.readHeader(BCF2Codec.java:149)
  at org.seqdoop.hadoop_bam.util.VCFHeaderReader.readHeaderFrom(VCFHeaderReader.java:67)
  at org.bdgenomics.adam.rdd.ADAMContext.org$bdgenomics$adam$rdd$ADAMContext$$readVcfHeader(ADAMContext.scala:228)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadHeaderLines$1.apply(ADAMContext.scala:234)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadHeaderLines$1.apply(ADAMContext.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at org.bdgenomics.adam.rdd.ADAMContext.loadHeaderLines(ADAMContext.scala:234)
  at org.bdgenomics.adam.rdd.ADAMContext.loadParquetVariants(ADAMContext.scala:1175)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVariants$1.apply(ADAMContext.scala:1733)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVariants$1.apply(ADAMContext.scala:1728)
  at scala.Option.fold(Option.scala:158)
  at org.apache.spark.rdd.Timer.time(Timer.scala:48)
  at org.bdgenomics.adam.rdd.ADAMContext.loadVariants(ADAMContext.scala:1726)
  ... 50 elided

heuermh commented 7 years ago

@fnothaft Ping for feedback on this issue.

cmnbroad commented 7 years ago

Once we get https://github.com/samtools/htsjdk/pull/837 merged in to htsjdk, we should be able to use it to fix this issue and eliminate the try/catch fallback.

HadoopGenomics / Hadoop-BAM

VCFHeaderReader BCF encoded file exception thrown for unrelated VCF header errors #132