CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

Handling and outputting unmapped reads in dedup module #516

Closed alexander-e-f-smith closed 1 year ago

alexander-e-f-smith commented 2 years ago

Hi Is it possible to output unaligned reads in umitools dedup module and not just group module. I'm trying to maximize use of non-aligned reads (star) for fusion calling, through UMI deduplication. My best solution so far is to extract unaligned reads prior to UMI dedup and then merge back into data afterwards I have tried the '--unmapped-reads=use' and 'unpaired-reads =use'' but these seem to: A.cause the memory use ballooning; B. don't actually make use of many relevant partially unaligned/improperly aligned reads as seen in downstream data supporting fusion reads, as compared to 'discarding' these reads through that dedup module option. It would be nice to output the unaligned reads that dedup module hasn't successfully used (when running '--unmapped-reads=use' for example), rather than not bothering to try and de-dup these reads at all etc. Best A

IanSudbery commented 2 years ago

@TomSmithCGAT Do you remember why we don't allow the output of all unmapped reads in dedup?

TomSmithCGAT commented 2 years ago

It's essentially a design choice on our part. dedup is explicitly to designed to return a dedupicated BAM, which requires them to be mapped, so there's no scope to return unmapped reads. group is designed to group reads by their UMI, but we've allowed a slightly wider scope so that ungrouped reads can also be returned.

From the bundle_iterator, group can handle reads which are not 'grouped':

https://github.com/CGATOxford/UMI-tools/blob/5c2dd0fd208df3a8f93399c99b1d164aef8094be/umi_tools/group.py#L237-L245

Whereas,dedup cannot: https://github.com/CGATOxford/UMI-tools/blob/5c2dd0fd208df3a8f93399c99b1d164aef8094be/umi_tools/dedup.py#L309-L311

TomSmithCGAT commented 2 years ago

@alexander-e-f-smith, if you do want to retain a mixture of deduplicated mapped reads and undeduplicated unmapped reads, your approach of extracting these and then adding back post-deduplication seems the best route.

If you need a very bespoke deduplication process, you might need to implement this yourself. You can make use of the UMI-tools API to use the same underlying deduplication algorithm (https://umi-tools.readthedocs.io/en/latest/API.html)

IanSudbery commented 2 years ago

Oh year, it didn't cross my mind to just add back the unmapped reads at the end.

alexander-e-f-smith commented 2 years ago

Thanks both Can you confirm that umidedup will attempt to dedup the one read of a pair that is aligned (and throw the unaligned mate away). So it makes sense to recover unmapped mate after dedup (via outputted singleton read names). However, will dedup module only work on such read pairs if READ1 specifically is aligned (and read2 mate unaligned).

IanSudbery commented 2 years ago

That is correct, yes.