dotnetbio / bio

Bioinformatics library for .NET
Apache License 2.0
143 stars 49 forks source link

BamParser.Parse() returns null objects in rare instances #41

Open RoboStephen opened 4 years ago

RoboStephen commented 4 years ago

Today I used dotnetbio to parse a bam file. In rare instances (for a total of 3 reads out of 1 million), BamParser.Parse() returned a null object rather than a SAMAlignedSequence object.

I stepped through in the debugger and I think these lines, in BamParser.cs, are the cause of my issue: if (alignedSeq.RefEndPos - 1 < start && alignedSeq.RName!="*") { return null; }

Here I'm reading the whole bam, so start == 0. This is a read which is unmapped, but it still has its Pos set to 1 and RName set because its mate was mapped with Pos==1. For the affected read, alignedSeq.RefEndPos is 0. By subtracting 1 from alignedSeq.RefEndPos we get -1, which is less than start == 0, so we return null.

This one-line change fixes the bug, and I believe it's correct in general - I confirmed the unit tests still pass: if (alignedSeq.RefEndPos - 1 < start && alignedSeq.CIGAR != "*") { return null; }

evolvedmicrobe commented 4 years ago

Hi,

Thanks for commenting on this!

This strikes me as a bit tricky, I think the method you're looking at: GetAlignedSequence(int start, int end), is designed to return all queries that overlap a range, if the read is associated with a chromosome (e.g. RNAME!=*). For reads that are unmapped but associated with a chromosome for sorting purposes, I think there are circumstances both where one would, and would not, like to consider that read to overlap with the region queried. Do you happen to know what samtools view does for these reads if one specifies the start/end coordinates on the chromosome? I think when coordinates are specified the library should probably match that behavior.

When coordinates are not specified these reads should be returned, so that's a bug as you point out. Right now it looks like this case is indicated with a 0/MaxInt for start/end, perhaps with a different sentinel value to indicate all reads regardless of overlap should be returned (start = -1? ) we could get the correct behavior?. Right now if a read is unmapped (e.g. Bit 0x4) we shouldn't make any assumptions about RNAME/CIGAR/POS according the SAM spec, yet we'd want to return reads like this in all cases when we're parsing everything. Any idea if changing:

return GetAlignedSequence(0, int.MaxValue);

to return GetAlignedSequence(-1, int.MaxValue); in some form is also compatible with the current API?

RoboStephen commented 4 years ago

I ran a couple of samtools view tests. It looks like samtools treats the unmapped read as though it overlaps the first base of the chromosome and no others, when it comes to including or excluding it. That seems appropriate to me:

I confirmed that if we change the start and end parameters to GetAlignedSequence to -1 and int.MaxValue (instead of 0 and int.MaxValue), that resolves this issue too. (There are several APIs which use those sentinel values; if we change one, we'd want to change them all for consistency)

Yet another option: If GetAlignedSequence accepted nullable ints, I think that would make the logic clearer than having special explicit sentinel values like 0 or -1 and int.MaxValue. GetAlignmentMapIterator already has one nullable-int argument (refSeq) so making start and end nullable seems natural.