HadoopGenomics / Hadoop-BAM

Hadoop-BAM is a Java library for the manipulation of files in common bioinformatics formats using the Hadoop MapReduce framework
MIT License
69 stars 52 forks source link

VCFHeaderReader BCF encoded file exception thrown for unrelated VCF header errors #132

Open heuermh opened 7 years ago

heuermh commented 7 years ago

VCFHeaderReader uses a try catch to fall back to BCF encoding, which leads to incorrect error messages and stack trace if the header is actually VCF format but has unrelated errors.

E.g. Here the first exception should have been thrown (Your input file has a malformed header: Count < 0 for fixed size VCF header field BAD_PS), not logged as a warning, and the second exception should not have happened (Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0).

scala> val variants = sc.loadVariants("truth_small_variants.variants.adam")
warning: while trying to read VCF header from file received exception: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Count < 0 for fixed size VCF header field BAD_PS
htsjdk.tribble.TribbleException: Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0:
  at htsjdk.variant.bcf2.BCF2Codec.error(BCF2Codec.java:478)
  at htsjdk.variant.bcf2.BCF2Codec.readHeader(BCF2Codec.java:149)
  at org.seqdoop.hadoop_bam.util.VCFHeaderReader.readHeaderFrom(VCFHeaderReader.java:67)
  at org.bdgenomics.adam.rdd.ADAMContext.org$bdgenomics$adam$rdd$ADAMContext$$readVcfHeader(ADAMContext.scala:228)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadHeaderLines$1.apply(ADAMContext.scala:234)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadHeaderLines$1.apply(ADAMContext.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at org.bdgenomics.adam.rdd.ADAMContext.loadHeaderLines(ADAMContext.scala:234)
  at org.bdgenomics.adam.rdd.ADAMContext.loadParquetVariants(ADAMContext.scala:1175)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVariants$1.apply(ADAMContext.scala:1733)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVariants$1.apply(ADAMContext.scala:1728)
  at scala.Option.fold(Option.scala:158)
  at org.apache.spark.rdd.Timer.time(Timer.scala:48)
  at org.bdgenomics.adam.rdd.ADAMContext.loadVariants(ADAMContext.scala:1726)
  ... 50 elided
heuermh commented 7 years ago

@fnothaft Ping for feedback on this issue.

cmnbroad commented 7 years ago

Once we get https://github.com/samtools/htsjdk/pull/837 merged in to htsjdk, we should be able to use it to fix this issue and eliminate the try/catch fallback.