broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 594 forks source link

Catching when SAM with .bam extension is loaded in Spark #3488

Open mwalker174 opened 7 years ago

mwalker174 commented 7 years ago

I'm pretty sure this is a hadoop-bam issue, but I'm finding that any BAM produced by bwa (VN 0.7.16a-r1181) will not load in Spark. The BAM loads successfully in ValidateSamFile (although it throws errors because there are no RGs). Running it through AddOrReplaceReadGroups makes the error go away.

Attempting to load from local disk gives the following error:

htsjdk.samtools.SAMFormatException: Does not seem like a BAM file at org.seqdoop.hadoop_bam.BAMSplitGuesser.<init>(BAMSplitGuesser.java:88) at org.seqdoop.hadoop_bam.BAMInputFormat.addProbabilisticSplits(BAMInputFormat.java:228) at org.seqdoop.hadoop_bam.BAMInputFormat.getSplits(BAMInputFormat.java:155) at org.seqdoop.hadoop_bam.AnySAMInputFormat.getSplits(AnySAMInputFormat.java:252) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:121) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1913) at org.apache.spark.rdd.RDD.count(RDD.scala:1134) at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:454) at org.apache.spark.api.java.AbstractJavaRDDLike.count(JavaRDDLike.scala:45) at org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqDiscoverySpark.runTool(PathSeqDiscoverySpark.java:593) at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:353) at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152) at org.broadinstitute.hellbender.Main.main(Main.java:233)

lbergelson commented 7 years ago

@mwalker174 Many gatk tools require our bams to have readgroups. We should probably update our bwa tools to add readgroups, although fixing hadoop-bam to give a more useful error message would be good as well.

lbergelson commented 7 years ago

@tomwhite Any thoughts on this?

tomwhite commented 7 years ago

Looking at the error it seems to be failing because the file doesn't have a BGZF magic number. Can you post the first few bytes of the file (via hexdump or similar)?

mwalker174 commented 7 years ago

Looks like there's just the header: $ hexdump -c -n 16 COAD-WGS-Nonhost-TP.trinityunmapped.bam 0000000 @ S Q \t S N : T R I N I T Y D 0000010

tomwhite commented 7 years ago

Not sure that's a valid BAM file...

mwalker174 commented 7 years ago

Looks like this is my fault... I didn't realize BWA produces SAM output and the non-spark tool was correcting my mistake automatically (by checking for a magic number). Can we make the error message more informative like: "BAM file must start with BGZF magic number"?

It would be great to detect whether it's SAM or BAM by checking the file contents, as in non-spark tools that use htsjdk, rather than the extension. Is this easily done?

@lbergelson To clarify I was using the regular BWA binaries not the GATK BWA tool.