Open eboyden opened 1 year ago
I am not opposed to allowing the user to specify the order of functions and having the current be the default.
because currently MD/NM/UQ tag repair is required).
What if you only repaired on the second pass?
That gets to my question in https://github.com/fulcrumgenomics/fgbio/issues/877 - I wasn't sure if tag repair was required for subsequent steps to behave properly. If not, then making tag repair optional in all tools would certainly be beneficial, since then you can just do it once at the end without a lot of extraneous upstream sorting. In this case, I don't know if simply running ClipBam twice in succession would be any slower than calling both functions (in the preferred order) in a single ClipBam instance. Being able to call both functions simultaneously would be a bit simpler though.
If at this point you think it's too much effort for too little benefit to make the function order customizable (or just switching the order, since I can't think of any case where upstream --clip-overlapping-reads
doesn't make downstream --clip-bases-past-mate
irrelevant, so I can't imagine anyone wanting to run both functions in that order), then I would suggest at least making the function order clear in the documentation.
I think all of those are good suggestions, I’ll again be transparent (and it’s totally fine to ask questions/suggestions) that supporting fgbio is on a volunteer basis and right now personally I’m prioritizing client work.
@nh13 - I'll contribute this feature.
ClipBam
--clip-overlapping-reads
can occasionally orphan one read in a pair if it is much longer than its mate and extends past the mate's start, because equal trimming causes the mate to be trimmed to length 0. The new function--clip-bases-past-mate
solves this problem when ClipBam is run twice sequentially, which is great but requires 2X sorting in-between (because currently MD/NM/UQ tag repair is required). Both functions can be called simultaneously, but testing shows that the order of operations is--clip-overlapping-reads
before--clip-bases-past-mate
(or possibly--clip-overlapping-reads
supercedes--clip-bases-past-mate
since the former renders the latter superfluous). It would be better if--clip-bases-past-mate
occurred first, which would ensure that paired reads (virtually) never become orphaned regardless of their respective lengths.