CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

umi_tools extract: UMI on 5' read 1 and 3' read 2 #623

Closed lzt5269 closed 1 month ago

lzt5269 commented 10 months ago

Hi,

I'm working on paired-end data. Read 1 has 10 UMI at the beginning and read 2 has 10 UMI which is reverse complement to UMI on read 1 at the end. How should I extract UMI and remove them from both reads?

Thanks.

TomSmithCGAT commented 9 months ago

Hi @lzt5269,

Sorry for the slow reply on this one. This is outside the expected functionality of UMI-tools, but I think you can acheive this with the following, which uses regex pattern matching that takes longer for simple UMI extractions, but allows more flexibility. Here, we specify that the UMI of read 1 is 10 characters (bases) at the start (--bc-pattern='(?P<umi_1>.{10}).*'). For read 2, we give a pattern that doesn't include any UMI group, just a group to discard, which is the last 10 bases (--bc-pattern2='.*(?P<discard_1>.{10})').

umi_tools extract 
--extract-method=regex
--bc-pattern='(?P<umi_1>.{10}).*'
--bc-pattern2='.*(?P<discard_1>.{10})'
-L test.log
--read2-in=<PATH TO READ2 FILE>
--stdin=<PATH TO READ1 FILE>
--read2-out=<PATH TO READ2 OUTFILE> |
gzip > <PATH TO READ1 OUTFILE>

I recommend manually check that the above is giving you the expected output for the first read pair.

Of course, the ideal solution would be to use the two UMIs to correct any sequencing errors in them and obtain a consensus UMI sequence. I expect it's probably little benefit for the effort required however.

TomSmithCGAT commented 8 months ago

@lzt5269 - Did the above work?