Open JoeVieira opened 12 months ago
Bumping this. As it pertains to a stealthy bug that can produce incorrect output, rather than a new feature, I hope it can be prioritized.
Attention: Patch coverage is 91.66667%
with 1 line
in your changes missing coverage. Please review.
Project coverage is 95.63%. Comparing base (
ab8959d
) to head (a2a37da
). Report is 25 commits behind head on main.
Files | Patch % | Lines |
---|---|---|
...ala/com/fulcrumgenomics/bam/SamRecordClipper.scala | 91.66% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@nh13 @tfenne
When operating on SamRecords with multiple serialized operations, use of mate cigar record information is relied upon for clipping past mate end, but it is not synchronized between operations with it's corresponding SamRecord object's data.
This results in an inconsistent data model for clipping.
With previous code this test case would fail.
It would erroneously calculate r2.start == 100 & r1.start != r2.start, because the mate.start data from the MC tag was used, which was not updated with the 5' clipping from the first operation.
https://github.com/fulcrumgenomics/fgbio/blob/2af51acea8cd55fbc393ce435b5b13d7a32fc9ae/src/main/scala/com/fulcrumgenomics/bam/SamRecordClipper.scala#L358
This reliance on a possibly dirty MC tag is the cause of this #878 inconsistent behavior. I believe this fix, makes this issue irrelevant & it certainly might fix some other oddities that people seem to have pointed out.
I've removed convenience functions which allow this to happen, and enforce either explicit passing of mates or start / end.
I do this rather than updating the MC record, because that extra step between each operation doesn't seem worthwhile, when we have the object loaded in memory already, the tags should be updated after all operations are complete a single time.
The exception to this is a single method which is used in consensus calling, where mate isn't easily available - this pattern still requires getting from mate cigar, which might still result in this bug occurring, as each mate could have clipping applied but again the data isn't synchronized to the MC tag.
I'll also work on that bug also, which I believe should be handled by operating on mates together, so all relevant data is loaded in memory explicitly, rather than using the tag.
As a sidenote: I don't understand the white space formatting rules for the project, do you happen to have an intellj profile i could use to keep these consistent for your formatting?