umi_tools extract: UMI on 5' read 1 and 3' read 2

lzt5269 commented 10 months ago

Hi,

I'm working on paired-end data. Read 1 has 10 UMI at the beginning and read 2 has 10 UMI which is reverse complement to UMI on read 1 at the end. How should I extract UMI and remove them from both reads?

Thanks.

TomSmithCGAT commented 9 months ago

Hi @lzt5269,

Sorry for the slow reply on this one. This is outside the expected functionality of UMI-tools, but I think you can acheive this with the following, which uses regex pattern matching that takes longer for simple UMI extractions, but allows more flexibility. Here, we specify that the UMI of read 1 is 10 characters (bases) at the start (--bc-pattern='(?P<umi_1>.{10}).*'). For read 2, we give a pattern that doesn't include any UMI group, just a group to discard, which is the last 10 bases (--bc-pattern2='.*(?P<discard_1>.{10})').

umi_tools extract 
--extract-method=regex
--bc-pattern='(?P<umi_1>.{10}).*'
--bc-pattern2='.*(?P<discard_1>.{10})'
-L test.log
--read2-in=<PATH TO READ2 FILE>
--stdin=<PATH TO READ1 FILE>
--read2-out=<PATH TO READ2 OUTFILE> |
gzip > <PATH TO READ1 OUTFILE>

I recommend manually check that the above is giving you the expected output for the first read pair.

Of course, the ideal solution would be to use the two UMIs to correct any sequencing errors in them and obtain a consensus UMI sequence. I expect it's probably little benefit for the effort required however.

TomSmithCGAT commented 8 months ago

@lzt5269 - Did the above work?

CGATOxford / UMI-tools

umi_tools extract: UMI on 5' read 1 and 3' read 2 #623