bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Apache License 2.0
1k stars 308 forks source link

SAMFileHeader "sort order" attribute being un-set during file-save job #800

Closed ryan-williams closed 8 years ago

ryan-williams commented 9 years ago

I am currently seeing the SAMFileHeader header in ADAMSAMOutputFormat get mutated during the saveAsNewAPIHadoopFile job; specifically, the "sorted order" (attribute "SO") is being dropped, resulting in sorter SAMs that had SO:coordinate set on the header being written with SO:unsorted, and the header having unsorted in this attribute the file-write is done.

This branch repros it with some printlns; run:

mvn clean package -DskipTests
bin/adam-submit transform -single -sort_reads adam-cli/src/test/resources/bqsr1.sam sorted.sam

Here's a gist fo the full output of the latter command.

Section of note (L424-439 of the gist):

org.bdgenomics.adam.rdd.read.ADAMSAMOutputFormat$@274bcf9 return header: Some(SAMFileHeader{VN=1.4, SO=coordinate})
got header: SAMFileHeader{VN=1.4, SO=coordinate} from org.bdgenomics.adam.rdd.read.ADAMSAMOutputFormat$@274bcf9
15/08/23 22:33:56 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/08/23 22:33:56 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 4 ms
15/08/23 22:33:57 INFO BlockManagerInfo: Removed broadcast_7_piece0 on localhost:60426 in memory (size: 2.6 KB, free: 265.4 MB)
15/08/23 22:33:57 INFO BlockManagerInfo: Removed broadcast_4_piece0 on localhost:60426 in memory (size: 1763.0 B, free: 265.4 MB)
15/08/23 22:33:57 INFO BlockManagerInfo: Removed broadcast_3_piece0 on localhost:60426 in memory (size: 2.1 KB, free: 265.4 MB)
15/08/23 22:33:57 INFO ContextCleaner: Cleaned shuffle 1
15/08/23 22:33:57 INFO FileOutputCommitter: Saved output of task 'attempt_201508232233_0012_r_000000_0' to file:/Users/ryan/c/adam/sorted.sam/_temporary/0/task_201508232233_0012_r_000000
15/08/23 22:33:57 INFO Executor: Finished task 0.0 in stage 8.0 (TID 5). 844 bytes result sent to driver
15/08/23 22:33:57 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 5) in 605 ms on localhost (1/1)
15/08/23 22:33:57 INFO DAGScheduler: ResultStage 8 (saveAsNewAPIHadoopFile at AlignmentRecordRDDFunctions.scala:188) finished in 0.605 s
15/08/23 22:33:57 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool
15/08/23 22:33:57 INFO DAGScheduler: Job 3 finished: saveAsNewAPIHadoopFile at AlignmentRecordRDDFunctions.scala:188, took 0.668444 s
org.bdgenomics.adam.rdd.read.ADAMSAMOutputFormat$@274bcf9 return header: Some(SAMFileHeader{VN=1.4, SO=unsorted})
middle: SAMFileHeader{VN=1.4, SO=unsorted}

This is some output from the saveAsNewAPIHadoopFile job.

The result, as the gist shows, is a SAM file with SO:unsorted in its header:

$ head sorted.sam
@HD VN:1.4  SO:unsorted
…

I've been reproducing this deterministically for hours, originally starting from within a test case I was writing while working on #799.

ryan-williams commented 9 years ago

After a long slog, I think I've found the culprit: a bug in hadoop-bam's use of HTSJDK.

Here I've patched a hadoop-bam fork with the fix, off of 7.0.0. Here I have ADAM picking up the patched version of hadoop-bam above.

The bug is that SAMTextWriter.setSortOrder is not called before SAMTextWriter.setHeader, despite this comment stating that it must (link is from htsjdk 1.118, which hadoop-bam 7.0.0 uses a lightly-patched fork of, but this API and comment still exist in htsjdk 1.138). Why htsjdk is set up that way, I have no idea.

To test the bug and my fix for yourself:

git clone git@github.com:ryan-williams/Hadoop-BAM.git
cd hadoop-bam
# writer-fix branch should already be checked out
mvn install -DskipTests

Then, in any ADAM repo:

bin/adam-submit transform -single -sort_reads adam-apis/src/test/resources/small.sam sorted.sam
head -n 1 sorted.sam

You'll (incorrectly) get SO:unsorted, even though the header and reads will have been sorted:

@HD VN:1.4  SO:unsorted

To test with my fix:

git remote add ryan git@github.com:ryan-williams/adam.git
git fetch ryan
git checkout ryan/hadoop-bam-fix
mvn package -DskipTests
mv sorted.sam{,.bak}
bin/adam-submit transform -single -sort_reads adam-apis/src/test/resources/small.sam sorted.sam

This run will output a sorted.sam with the correct SO:coordinate tag in the header line:

diff sorted.sam{,.bak}
1c1
< @HD   VN:1.4  SO:coordinate
---
> @HD   VN:1.4  SO:unsorted

So, I'll file this against hadoop-bam; we could move to a fork if it is bothering us enough, but I think this is just relegated to the SO attribute on the @HD line, so we don't really need to get it fixed, I suppose.

ryan-williams commented 9 years ago

Filed https://github.com/HadoopGenomics/Hadoop-BAM/issues/21 and https://github.com/HadoopGenomics/Hadoop-BAM/pull/22.

fnothaft commented 9 years ago

Nice! Fun to read through the thorough debug trace here. I assume this fix will be holding in place until the next release of Hadoop-BAM, then?

ryan-williams commented 9 years ago

Yea, I guess so. We could probably push them to release a 7.1.1; do you have thoughts on upgrading to e.g. 7.1.0 in general? Any idea whether it might be hard or easy?

fnothaft commented 8 years ago

I think this was resolved by #917.