Initial implementation of shuffle command

sjbaker47 commented 2 years ago

I think we might want to talk about paired reads again -- not sure if our understandings are in alignment. (No pun intended.) Otherwise, let me know of any issues.

haghshenas commented 2 years ago

As you were suspecting, this implementation does not handle paried-end reads properly. In fact, this doesn't take into account supplementary alignments either. I have added codes that helps to iterate through BAM records of each read properly while taking supplementary alignments into account. That codes is currently in 'bugfix_supplementary' together with bunch of other bugfixes that are critical. We should have a chat when I merge that branch so that you can fix your implementation.

haghshenas commented 2 years ago

We had a discussion with Shawn today and we confirmed that this should be updated to support paired-end and supplementary alignments. @sjbaker47 please let us know whenever the fixes are ready and we can continue reviewing. Thanks.

haghshenas commented 2 years ago

@sjbaker47 I think the way you've used hashes for handling paired-end and supplementary alignments is smart and makes things easy, but might be too simplistic, i.e. does not handle complex cases. Here I drew one example for a chimeric alignment (I know we don't really care about chimeric alignments for the task of quantification using Salmon, but it would be nice to have a general purpose mudskipper).

The transcriptomic alignments of this read would look like this:

qid:q1, tid:txp1, pos:50,  mate_tid:txp2, mate_pos:100
qid:q1, tid:txp2, pos:100, mate_tid:txp1, mate_pos:50
qid:q1, tid:txp1, pos:50,  mate_tid:txp3, mate_pos:100
qid:q1, tid:txp3, pos:100, mate_tid:txp1, mate_pos:50

But I think your sorting function will sort them this way:

qid:q1, tid:txp1, pos:50,  mate_tid:txp2, mate_pos:100
qid:q1, tid:txp1, pos:50,  mate_tid:txp3, mate_pos:100
qid:q1, tid:txp2, pos:100, mate_tid:txp1, mate_pos:50
qid:q1, tid:txp3, pos:100, mate_tid:txp1, mate_pos:50

Explanation: they all have the same min(tid, mate_tid) and min(pos, mate_pos) but the position for the first mate is always smaller than the second mate. If what I said doesn't make sense, then I probably didn't understand your sorting function.

I could think of other edge cases as well, but for now this should give you the idea. I personally think you should sort by query id and then do a post-processing to put alignments of a single query in the right order based on paired information and SA tag. I think we might benefit from involving @gmarcais and @rob-p to this discussion.

sjbaker47 commented 2 years ago

I see what you mean. We could deterministically concatenate the two fields (e.g. hash(concat(sort(tid, mtid))), and that would take care of any ambiguity between paired reads spanning chromosomes that coincidentally have the same position with another paired mapping on a different chromosome with the same read name. (Writing that all out, it seems deathly unlikely to actually occur, but it would be an near undetectable bug if it did, lol).

OceanGenomics / mudskipper

Initial implementation of shuffle command #11