fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
311 stars 67 forks source link

feature request: switch order of operations for ClipBam #878

Open eboyden opened 1 year ago

eboyden commented 1 year ago

ClipBam --clip-overlapping-reads can occasionally orphan one read in a pair if it is much longer than its mate and extends past the mate's start, because equal trimming causes the mate to be trimmed to length 0. The new function --clip-bases-past-mate solves this problem when ClipBam is run twice sequentially, which is great but requires 2X sorting in-between (because currently MD/NM/UQ tag repair is required). Both functions can be called simultaneously, but testing shows that the order of operations is --clip-overlapping-reads before --clip-bases-past-mate (or possibly --clip-overlapping-reads supercedes --clip-bases-past-mate since the former renders the latter superfluous). It would be better if --clip-bases-past-mate occurred first, which would ensure that paired reads (virtually) never become orphaned regardless of their respective lengths.

nh13 commented 1 year ago

I am not opposed to allowing the user to specify the order of functions and having the current be the default.

because currently MD/NM/UQ tag repair is required).

What if you only repaired on the second pass?

eboyden commented 1 year ago

That gets to my question in https://github.com/fulcrumgenomics/fgbio/issues/877 - I wasn't sure if tag repair was required for subsequent steps to behave properly. If not, then making tag repair optional in all tools would certainly be beneficial, since then you can just do it once at the end without a lot of extraneous upstream sorting. In this case, I don't know if simply running ClipBam twice in succession would be any slower than calling both functions (in the preferred order) in a single ClipBam instance. Being able to call both functions simultaneously would be a bit simpler though.

If at this point you think it's too much effort for too little benefit to make the function order customizable (or just switching the order, since I can't think of any case where upstream --clip-overlapping-reads doesn't make downstream --clip-bases-past-mate irrelevant, so I can't imagine anyone wanting to run both functions in that order), then I would suggest at least making the function order clear in the documentation.

nh13 commented 1 year ago

I think all of those are good suggestions, I’ll again be transparent (and it’s totally fine to ask questions/suggestions) that supporting fgbio is on a volunteer basis and right now personally I’m prioritizing client work.

JoeVieira commented 1 year ago

@nh13 - I'll contribute this feature.