fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
311 stars 67 forks source link

Bug Fix - SamRecordClipper.numBasesExtendingPastMate #937

Open JoeVieira opened 12 months ago

JoeVieira commented 12 months ago

@nh13 @tfenne

When operating on SamRecords with multiple serialized operations, use of mate cigar record information is relied upon for clipping past mate end, but it is not synchronized between operations with it's corresponding SamRecord object's data.

This results in an inconsistent data model for clipping.

With previous code this test case would fail.

it should "clip FR reads that extend past mate in asymmetrical read" in {
    val builder = new SamBuilder(readLength = 200, sort = Some(Queryname))
    val clipper = new ClipBam(input = dummyBam, output = dummyBam, ref = ref, clipBasesPastMate = true,
      readTwoFivePrime = 50, readOneFivePrime = 10)
    val Seq(r1, r2) = builder.addPair(start1 = 100, start2 = 90)

    r1.end shouldBe 299
    r2.end shouldBe 289
    clipper.clipPair(r1, r2)
    r2.end shouldBe r2.end
    r1.start shouldBe r2.start
  }

It would erroneously calculate r2.start == 100 & r1.start != r2.start, because the mate.start data from the MC tag was used, which was not updated with the 5' clipping from the first operation.

https://github.com/fulcrumgenomics/fgbio/blob/2af51acea8cd55fbc393ce435b5b13d7a32fc9ae/src/main/scala/com/fulcrumgenomics/bam/SamRecordClipper.scala#L358

This reliance on a possibly dirty MC tag is the cause of this #878 inconsistent behavior. I believe this fix, makes this issue irrelevant & it certainly might fix some other oddities that people seem to have pointed out.

I've removed convenience functions which allow this to happen, and enforce either explicit passing of mates or start / end.

I do this rather than updating the MC record, because that extra step between each operation doesn't seem worthwhile, when we have the object loaded in memory already, the tags should be updated after all operations are complete a single time.

The exception to this is a single method which is used in consensus calling, where mate isn't easily available - this pattern still requires getting from mate cigar, which might still result in this bug occurring, as each mate could have clipping applied but again the data isn't synchronized to the MC tag.

I'll also work on that bug also, which I believe should be handled by operating on mates together, so all relevant data is loaded in memory explicitly, rather than using the tag.

As a sidenote: I don't understand the white space formatting rules for the project, do you happen to have an intellj profile i could use to keep these consistent for your formatting?

eboyden commented 5 months ago

Bumping this. As it pertains to a stealthy bug that can produce incorrect output, rather than a new feature, I hope it can be prioritized.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 95.63%. Comparing base (ab8959d) to head (a2a37da). Report is 25 commits behind head on main.

Files Patch % Lines
...ala/com/fulcrumgenomics/bam/SamRecordClipper.scala 91.66% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #937 +/- ## ========================================== + Coverage 95.60% 95.63% +0.02% ========================================== Files 126 126 Lines 7307 7308 +1 Branches 507 479 -28 ========================================== + Hits 6986 6989 +3 + Misses 321 319 -2 ``` | [Flag](https://app.codecov.io/gh/fulcrumgenomics/fgbio/pull/937/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=fulcrumgenomics) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/fulcrumgenomics/fgbio/pull/937/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=fulcrumgenomics) | `95.63% <91.66%> (+0.02%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=fulcrumgenomics#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.