bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Apache License 2.0
1k stars 308 forks source link

Transient bad GZIP header bug when loading BGZF FASTQ #1658

Closed fnothaft closed 7 years ago

fnothaft commented 7 years ago
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 27, 10.126.251.149, executor 0): htsjdk.samtools.SAMFormatException: Invalid GZIP header
    at htsjdk.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:121)
    at htsjdk.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:96)
    at htsjdk.samtools.util.BlockCompressedInputStream.inflateBlock(BlockCompressedInputStream.java:533)
    at htsjdk.samtools.util.BlockCompressedInputStream.processNextBlock(BlockCompressedInputStream.java:515)
    at htsjdk.samtools.util.BlockCompressedInputStream.nextBlock(BlockCompressedInputStream.java:451)
    at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:441)
    at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:194)
    at org.seqdoop.hadoop_bam.util.BGZFSplitCompressionInputStream.readWithinBlock(BGZFSplitCompressionInputStream.java:81)
    at org.seqdoop.hadoop_bam.util.BGZFSplitCompressionInputStream.read(BGZFSplitCompressionInputStream.java:48)
    at java.io.InputStream.read(InputStream.java:101)
    at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:370)
    at org.bdgenomics.adam.io.FastqRecordReader.positionAtFirstRecord(FastqRecordReader.java:244)
    at org.bdgenomics.adam.io.FastqRecordReader.<init>(FastqRecordReader.java:175)
    at org.bdgenomics.adam.io.SingleFastqInputFormat$SingleFastqRecordReader.<init>(SingleFastqInputFormat.java:53)
    at org.bdgenomics.adam.io.SingleFastqInputFormat.createRecordReader(SingleFastqInputFormat.java:112)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:178)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:177)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
fnothaft commented 7 years ago

An additional ask is to make sure that we have a flag to disable splitting.

fnothaft commented 7 years ago

This was resolved upstream in Hadoop-BAM.