fulcrumgenomics / fgsv

Tools to gather evidence for structural variation via breakpoint detection.
MIT License
19 stars 1 forks source link

Reducing EvidenceType to [SplitRead, ReadPair], adding lots of tests, support for unpaired data #4

Closed tfenne closed 2 years ago

tfenne commented 2 years ago

This PR got a it bigger than I had originally intended. There are a few major changes rolled into one PR:

  1. Reducing EvidenceType down to just SplitRead and ReadPair, and everything that follows from that. As part of this I restructured the determine* methods in SvPileup to be called is* and return booleans, as the construction of the BreakpointEvidence is the same in all cases and can be pulled up into the calling method. I also collapsed isIntraContig and isOddPair because a) they've gotten simpler and b) I kept confusing myself as to which method should identify which kind of breakpoint where the ends are on the same contig (e.g. where does detection of the following belong: seg1=chr1:500:F, seg2=chr1:200:F`?)
  2. While doing this I noticed what I think is a large inconsistency in how AlignedSegment extraction was working, and a related bug in how "odd pair" detection was working. The short version is that depending on the branch you took, based on the number of supplementary alignments, R2 segments were sometimes strand flipped and sometimes not, making accurate odd-pair detection impossible. And the odd-pair detection that was there relied on the segments from R2 not being strand flipped. I've cleaned this up so that R2 records are always strand flipped (except, in 4 below)
  3. This made me really nervous so I wrote a whole bunch of tests that start at findBreakpoints(Template) so I could be sure that the whole process of converting SamRecords to AlignedSegments through to detection of breakpoints was working, and fixed up a few things I found along the way. I'm not much more confident it's doing the right thing.
  4. I would really like the tool to be able to make use of unpaired data and PE data where one end is either unmapped or has poor mapping quality. E.g. if you have a large insertion next to a rearrangement causing say R2 to be unmapped, it would be good to be able to call the rearrangement from R1. To this end I've pulled all read filtering up out of AlignedSegment and into SvPileup.filterTemplate() which checks to see which primary reads are acceptable and then filters those and the corresponding supplementary reads appropriately. And added a bunch of tests for this.