Feature request: deduplication of read pairs

ju-mu commented 3 years ago

For PE sequencing it would be essential if umicollapse could deduplicate pairs of reads within a UMI group rather than treat reads of a pair as separate UMI groups.

As an example, the implementation by umi_tools could be used which seems simple and efficient as it is based on the TLEN field to group by fragment length.

For chimeric and unpaired reads, it would be great if these reads could be optionally removed, used or printed out like implemented by umi_tools: https://github.com/CGATOxford/UMI-tools/pull/312

Many thanks!

Daniel-Liu-c0deb0t commented 3 years ago

I am working on paired-end mode right now (group reads by tlen and deduplicate based on the first read of each pair). I don't know why the result does not exactly match the results of UMI-tools. This current paired-end implementation discards chimeric, unpaired reads, and unmapped reads.

ju-mu commented 3 years ago

Fantastic, thanks!

Here are a few random thoughts on why the results might not exactly match umitools:

For ties, it might be important to provide a seed value to UMI-tools via --random-seed=123456789, if you haven't done so already
On https://github.com/CGATOxford/UMI-tools/issues there are a few discussions which make the paired end handling more transparent than looking at the code: 428 -> forward and reverse read pair handling 347 -> handling of multi mapped reads including small test dataset! 180 -> general remarks regarding paired read handling 357 -> Discussion regarding option --ignore-tlen, which groups pairs without matching TLENs

If time, memory and performance considerations allow, umicollapse would be superior to the umitools implementation of PE handling if it would take the average read quality of both reads in a pair into account.

Really looking forward to the new version!

Daniel-Liu-c0deb0t commented 3 years ago

Okay, paired-end handling should be good now. The main problem with the difference between UMICollapse and UMI-tools was actually that UMI-tools had an inconsistency with how unmapped PE reads are handled, and I've opened a PR in their repo: https://github.com/CGATOxford/UMI-tools/pull/443

Of course there are some limitations. Internally, PE handling is done by grouping by tlen and grouping PE reads by read 1. This works in a similar way as UMI-tools, so taking the average read quality of both paired reads into account is going to be difficult to implement (for future reference, this requires significantly restructuring the way BAM records are read, to match up pairs before deduplicating). Additionally, chimeric, unmapped, and unpaired reads are discarded. Adding flags for this should not be too hard, but a bunch of assumptions in the code need to be updated.

The main problem for me is that I don't have time to keep working on adding new features and testing them with UMI-tools in the future. I'd love to, but college is busy... Also, straying from the original goal of making deduplication fast at a single alignment will likely lead to a lot of difficulties due to how the code is currently structured. I'll try my best to resolve any bugs in the current feature set when they get reported, though. I'm not sure if UMICollapse is useful to you at its current stage, but I appreciate your help so far on testing UMICollapse and I am glad that you have used it.

I'll leave this issue open in case I decide to pick up on the rest of the features in the future.

ju-mu commented 3 years ago

Thanks a lot and I completely understand your points.

For my specific use case i.e. increasing efficiency of my UMI enabled variant detection pipeline, it would be a big step forward from a rather slow 2 step process umitools -> picard MarkDuplicates to a fast single tool process and I do not mind removing unmapped reads. Unfortunately though, I am relying on chimeric reads and it would also be great if the unpaired reads could be kept. Ideally the chimeric reads should be subjected to the paired end de-duplication, however the single end deduplication of those read pairs would be also OK.

So maybe until you have more time to spare, you could just print out the removed reads to a separate discarded.bam file if possible? Running this file in SE mode and merging back could yield the desired output.

Daniel-Liu-c0deb0t commented 3 years ago

I've made removing unpaired and chimeric reads optional! I just wasn't satisfied with leaving this unfinished.

Please give it a try and let me know if it works.

Daniel-Liu-c0deb0t / UMICollapse

Feature request: deduplication of read pairs #5