Open mwalker174 opened 7 years ago
@mwalker174 Many gatk tools require our bams to have readgroups. We should probably update our bwa tools to add readgroups, although fixing hadoop-bam to give a more useful error message would be good as well.
@tomwhite Any thoughts on this?
Looking at the error it seems to be failing because the file doesn't have a BGZF magic number. Can you post the first few bytes of the file (via hexdump or similar)?
Looks like there's just the header: $ hexdump -c -n 16 COAD-WGS-Nonhost-TP.trinityunmapped.bam 0000000 @ S Q \t S N : T R I N I T Y D 0000010
Not sure that's a valid BAM file...
Looks like this is my fault... I didn't realize BWA produces SAM output and the non-spark tool was correcting my mistake automatically (by checking for a magic number). Can we make the error message more informative like: "BAM file must start with BGZF magic number"?
It would be great to detect whether it's SAM or BAM by checking the file contents, as in non-spark tools that use htsjdk, rather than the extension. Is this easily done?
@lbergelson To clarify I was using the regular BWA binaries not the GATK BWA tool.
I'm pretty sure this is a hadoop-bam issue, but I'm finding that any BAM produced by bwa (VN 0.7.16a-r1181) will not load in Spark. The BAM loads successfully in ValidateSamFile (although it throws errors because there are no RGs). Running it through AddOrReplaceReadGroups makes the error go away.
Attempting to load from local disk gives the following error:
htsjdk.samtools.SAMFormatException: Does not seem like a BAM file at org.seqdoop.hadoop_bam.BAMSplitGuesser.<init>(BAMSplitGuesser.java:88) at org.seqdoop.hadoop_bam.BAMInputFormat.addProbabilisticSplits(BAMInputFormat.java:228) at org.seqdoop.hadoop_bam.BAMInputFormat.getSplits(BAMInputFormat.java:155) at org.seqdoop.hadoop_bam.AnySAMInputFormat.getSplits(AnySAMInputFormat.java:252) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:121) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1913) at org.apache.spark.rdd.RDD.count(RDD.scala:1134) at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:454) at org.apache.spark.api.java.AbstractJavaRDDLike.count(JavaRDDLike.scala:45) at org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqDiscoverySpark.runTool(PathSeqDiscoverySpark.java:593) at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:353) at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152) at org.broadinstitute.hellbender.Main.main(Main.java:233)