Closed ryan-williams closed 8 years ago
After a long slog, I think I've found the culprit: a bug in hadoop-bam's use of HTSJDK.
Here I've patched a hadoop-bam fork with the fix, off of 7.0.0. Here I have ADAM picking up the patched version of hadoop-bam above.
The bug is that SAMTextWriter.setSortOrder
is not called before SAMTextWriter.setHeader
, despite this comment stating that it must (link is from htsjdk 1.118, which hadoop-bam 7.0.0 uses a lightly-patched fork of, but this API and comment still exist in htsjdk 1.138). Why htsjdk is set up that way, I have no idea.
To test the bug and my fix for yourself:
git clone git@github.com:ryan-williams/Hadoop-BAM.git
cd hadoop-bam
# writer-fix branch should already be checked out
mvn install -DskipTests
Then, in any ADAM repo:
bin/adam-submit transform -single -sort_reads adam-apis/src/test/resources/small.sam sorted.sam
head -n 1 sorted.sam
You'll (incorrectly) get SO:unsorted
, even though the header and reads will have been sorted:
@HD VN:1.4 SO:unsorted
To test with my fix:
git remote add ryan git@github.com:ryan-williams/adam.git
git fetch ryan
git checkout ryan/hadoop-bam-fix
mvn package -DskipTests
mv sorted.sam{,.bak}
bin/adam-submit transform -single -sort_reads adam-apis/src/test/resources/small.sam sorted.sam
This run will output a sorted.sam
with the correct SO:coordinate
tag in the header line:
diff sorted.sam{,.bak}
1c1
< @HD VN:1.4 SO:coordinate
---
> @HD VN:1.4 SO:unsorted
So, I'll file this against hadoop-bam; we could move to a fork if it is bothering us enough, but I think this is just relegated to the SO
attribute on the @HD
line, so we don't really need to get it fixed, I suppose.
Nice! Fun to read through the thorough debug trace here. I assume this fix will be holding in place until the next release of Hadoop-BAM, then?
Yea, I guess so. We could probably push them to release a 7.1.1
; do you have thoughts on upgrading to e.g. 7.1.0
in general? Any idea whether it might be hard or easy?
I think this was resolved by #917.
I am currently seeing the
SAMFileHeader
header inADAMSAMOutputFormat
get mutated during thesaveAsNewAPIHadoopFile
job; specifically, the "sorted order" (attribute "SO") is being dropped, resulting in sorter SAMs that hadSO:coordinate
set on the header being written withSO:unsorted
, and the header havingunsorted
in this attribute the file-write is done.This branch repros it with some
println
s; run:Here's a gist fo the full output of the latter command.
Section of note (L424-439 of the gist):
This is some output from the
saveAsNewAPIHadoopFile
job.At the beginning, we see:
from this
println
that I added. The header hascoordinate
sort.At the end:
So the header now has
unsorted
sort order; note that both the aboveprintln
s show the same pointer address forADAMSAMOutputFormat
,@274bcf9
; its sort order value has been mutated by… something.header
isprivate
in this branch. How is it being modified?The result, as the gist shows, is a SAM file with
SO:unsorted
in its header:I've been reproducing this deterministically for hours, originally starting from within a test case I was writing while working on #799.