optional UMI - Githubissues

idot commented 3 years ago

Hello,

what is the best way to deal with optional UMIs as in smartseq3? zumis Here the 1. read contains in ~1/3 a 5' fixed sequence (ATTGCGCAATG) followed by an 8bp UMI.

I now used a regex and concatenated the filtered-out files with the UMI containing files.

    --extract-method=regex --bc-pattern='^(?P<discard_1>ATTGCGCAATG)(?P<umi_1>.{8}).*' \
    --filtered-out noumi.1.gz  \
    --filtered-out2 noumi.2.gz  \

a) Is there a clever regex that would do this without specifying --filtered-out? b) Is it possible to add some random sequence as UMI in the filtered-out files so that deduplication would not complain about the reads without UMI in the concatenated files?

Of course I could process the files afterwards with some scripts, but maybe UMI tools has or will have this functionality? Ideally then I could simply specify these as optional arguments in the nextflow core RNA-seq pipeline without having to fork the whole pipeline or restructure (deduplicate UMIs, then concatenate)

thank you very much, ido

IanSudbery commented 3 years ago

If I understand correctly, some of the reads contain a UMI, and they are marked by ATTGCGCAATG, which is followed by an 8nt UMI - these are what is referred to in the paper as the UMI reads and formed when the 5' end of the pair was the 5' end of the original, pre-tagmentation molecule.

Other reads do not contain a UMI. These are what the paper refers to as "internal reads". They are formed when both ends of the molecule are the result of tagmentation events.

For the regex, I'm sure there is a clever regex that would work, but unfortunately I don't know what it would be. For the dedup, there isn't currently a way to pass through without change reads that don't contain an UMI, but it wouldn't be hard to add something. But are you sure that is what you want to do?

I don't see that it would make sense to deduplicate UMI containing and internal reads together - no deduplication is possible for none UMI containing reads (their mapping positions would not be informative). You could deduplicate the UMI containing reads, but not the internal reads, but all that would do would be to down-weight the contribution of the UMI containing reads to the quantification. In the paper they say "generate expression profiles for both the 5′ ends containing UMIs as well as combined full-length and UMI data". They don't say if deduplication was applied to the combined data, nor can I find any documentation for this on their githubs.

My guess would be that their "find_pattern" in the zUMIs config is doing what our discard group is doing and filtering out reads that don't match the pattern. This way they just do the deduplication analysis on UMI containing reads. The combined set is then not deduplicated.

Long and short -

I'm sure there is a way to specify an optional group with the regex, probably involving lookback groups, but its beyond my regex foo
We can add a filter to dedup/group/count that just passes-through non-UMI containing reads if you want but....
I'm not sure thats the right thing to do.

idot commented 3 years ago

Thank you very much Ian

CGATOxford / UMI-tools

optional UMI #473