CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

UMItools dedup: It seems that when using '--unmapped/unpaired-reads=use' the unmapped read of a pair is discarded and the mapped counterpart is retained. This leads to genuine singletons/orphan reads (no paired read at all in dedup output ) when in fact it would have had a partner in input file which had an unaligned status. Id this the intended behaviour of the tool? #520

Closed alexander-e-f-smith closed 6 months ago

alexander-e-f-smith commented 2 years ago

UMItools dedup: It seems that when using '--unmapped/unpaired-reads=use' the unmapped read of a pair is discarded and the mapped counterpart is retained. This leads to genuine singletons/orphan reads (no paired read at all in dedup output ) when in fact it would have had a partner in input file which had an unaligned status. Id this the intended behaviour of the tool?

Originally posted by @alexander-e-f-smith in https://github.com/CGATOxford/UMI-tools/issues/519#issuecomment-1060982986

alexander-e-f-smith commented 2 years ago

Thanks for your help. Following on from previous questions: When using '--unmapped/unpaired-reads=use', can you confirm how the singleton reads are deduplicated when encountered please. Are these singletons assessed for duplication (grouped) against all reads or just other singleton reads (using which ever the mapped read of a pair is) - if the former, would there be cases where a there is a mixture of singletons and proper pairs in a single UMI/duplicate group of which either could be selected based on (default) mapping quality? On a related matter, is there a recommended running procedure when dealing/requiring unmapped/unpaired reads...eg selection of something other than --directional grouping method? This would in part be to counter performance issues

IanSudbery commented 2 years ago

Under normal circumstances, UMI-tools uses the read1 pos and template fields to group reads to be considered for UMI clustering. This continues to be true when --unmapped/unpaired-reads=use is set. What this means is that singleton read1s will have their position recorded as (pos, ""), and thus will only be clustered with other reads that have their position information as (pos, "").

On a related matter, is there a recommended running procedure when dealing/requiring unmapped/unpaired reads...eg selection of something other than --directional grouping method?

My personal instinct is to filter out unmapped/unpaired reads unless there is a specific reason to keep them. Unfortunately I don't think their are any parameters that can be tweaked that would improve performance.

TomSmithCGAT commented 6 months ago

Closing due to inactivity