fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
308 stars 67 forks source link

bug: the logic for the PileupBuilder `includeMapPositionsOutsideFrInsert` filter is wrong #980

Closed jrm5100 closed 4 months ago

jrm5100 commented 4 months ago

positionIsOutsideFrInsert is actually positionIsInsideFrInsert

https://github.com/fulcrumgenomics/fgbio/blob/3a74fd28ff630b417410953541ff107a1b7b0fb7/src/main/scala/com/fulcrumgenomics/bam/pileup/PileupBuilder.scala#L123-L128

The filter is used in a way that is reversed so that the current tests still work as designed.

https://github.com/fulcrumgenomics/fgbio/blob/3a74fd28ff630b417410953541ff107a1b7b0fb7/src/main/scala/com/fulcrumgenomics/bam/pileup/PileupBuilder.scala#L203-L205

Additionally there is a redundant rec.isFrPair call (this is done inside the method).

I came across this because I was tracking down some aberrant alleles in a pileup. Should reads mapped to separate chromosomes be rejected by this filter, or should that be a separate filter (likely in quickAcceptRecord)? Currently these are accepted even if includeMapPositionsOutsideFrInsert is false, which i think may be misleading.

clintval commented 4 months ago

I confirm:

  1. The privatepositionIsOutsideFrInsert was actually positionIsInsideFrInsert
  2. Despite (1), the use of positionIsOutsideFrInsert was still correct in function
  3. ~The additional rec.isFrPair was unnecessary~ EDIT: see below, this was not true.

Check out #981. I can't assign you as a reviewer, but would like your 👀 on it.

clintval commented 4 months ago

The history of the filter started 5 years ago in the context of somatic variant filtration:

https://github.com/fulcrumgenomics/fgbio/blob/ff43d10d5443874504632f6af2b27f6c7e0f23c2/src/main/scala/com/fulcrumgenomics/vcf/filtration/FilterSomaticVcf.scala#L140-L166

It was then refactored into the PileupBuilder and swapped from a negative filter (exclude) to a positive filter (include):

The filter still behaves correctly:

Action Behavior
True Do not apply the filter.
False Remove FR pairs where the position in that pair is outside the insert coordinates. Otherwise keep the record (single-end, RF pair, tandem pair).

Because of this semantic change, the includeMapPositionsOutsideFrInsert filter behaves correctly but its meaning might be unclear. Why would someone want to include map positions outside the FR pair? I now think the filter should behave exactly as it is now but be called:

And its default set to true.

clintval commented 4 months ago

What do you think of renaming includeMapPositionsOutsideFrInsert to includeMapPositionsInsideFrInsert and swapping the default boolean from false to true? It would be a breaking Scala API change, but gratefully the name change would require a developer who is using it to reconsider the filter when updating their use of the API.

clintval commented 4 months ago

@jrm5100 I don't think we want to exclude records in a pair where a mate maps to a different reference sequence. Although these pairs would not be proper, they should still count towards coverage when requesting a pile up of reads overlapping either of the reads in the pair.

But if you do not want to include pairs with reads mapped to different reference sequences, you can add a custom filter like:

PileupBuilder(source).withReadFilter(rec => rec.refName == rec.mateRefName)
jrm5100 commented 4 months ago

I think the main sticking point here is the intention of the filter.

Anything not isFRPair includes:

Reads that are isFrPair may look like this:

        ─────────────────────────►     

   ◄────────────────────────           

        │                  │           
        │   Insert Size    │           
        │                  │                            

and being OutsideFrInsert would mean in the 3' tails that likely aren't real.

So is the filter meant to:

The distinction is where isFrPair is used.

It seems like the second is the intended usage. In that case I think the current filter name and functionality is fine, but the internal function just needs to be renamed and used slightly differently:

if (compare) compare = if (!includeMapPositionsOutsideFrInsert && rec.isFrPair) {
          PileupBuilder.positionIsInsideFrInsert(rec, refIndex = refIndex, pos = pos)
        } else { compare }

In this case the && rec.isFrPair is actually necessary, since positionIsInsideFrInsert is False if the record isn't in a FR pair, and I think that was my main source of confusion for the intention of the filter.


@jrm5100 I don't think we want to exclude records in a pair where a mate maps to a different reference sequence. Although these pairs would not be proper, they should still count towards coverage when requesting a pile up of reads overlapping either of the reads in the pair.

But if you do not want to include pairs with reads mapped to different reference sequences, you can add a custom filter like:

PileupBuilder(source).withReadFilter(rec => rec.refName == rec.mateRefName)

Yeah, after reading through the code more I realized that and settled on

PileupBuilder(source).withReadFilter(rec => rec.isFrPair)

Which is slightly more strict, but works for my purposes.

clintval commented 4 months ago

Yes, this was the original intent of the filter:

Keep those not isFrPair but when they are isFrPair, make sure they aren't outside the insert region

So I need to revert the change I just made and update the documentation to be more clear. The original rec.isFrPair wasn't redundant! I've just gotten myself confused along the way.