CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

Extracting UMI into read header and removing trailing sequence #480

Closed abracarambar closed 3 years ago

abracarambar commented 3 years ago

Hi, I am trying to extract UMI out of a QIAseq miRNA Library (single ended). here is an example of read: @A00152:427:HFNTNDRXY:1:2101:7220:1000 1:N:0:NTCACTATGT+CTACCGAATT ANCACCGATGGAATGGCTTGGAGAAACTGTAGGCACCATCAATCGCCAGTGTAAG Adapter in bold, UMI in italics

I am using the following command:

umi_tools extract --extract-method=regex --bc-pattern=".+(?PAACTGTAGGCACCATCAAT){s<=2}(?P.{12})(?P.+$)" -I sample.fastq.gz -S sample.ed.fastq.gz

But it only manages to detect the regex for a small portion of the reads in the fastq file 2021-07-07 13:00:37,954 INFO regex does not match read1: 22298307 2021-07-07 13:00:37,954 INFO regex matches read1: 156155

The adapter AACTGTAGGCACCATCAAT is definitely detected in most reads if I use cutadapt. Is there something else that is not specified properly in my command? Thanks for the help.

TomSmithCGAT commented 3 years ago

I think you just need to drop the (?P<discard_2>.+$) from the end of your regex pattern.

That last group will only match when there is one or more base after the UMI, which I suspect only happens if the UMI is 13 nt due to an insertion :man_shrugging:

So your pattern becomes .+(?P<discard_1>AACTGTAGGCACCATCAAT){s<=2}(?P<umi_1>.{12})$ instead.

Getting regexes right can be tricky. I normally use e.g https://regex101.com/ to test them out. This also allows you to 'debug' your regex by stepping through the string to see where the issue is. The above website uses the base re module which doesn't support fuzzy matching so you need to take out the {s<=2} when testing.

TomSmithCGAT commented 3 years ago

Did this work?

abracarambar commented 3 years ago

Thank you so much, it did! I wasn't sure if there would be trailing adapters after the UMI but that does not seem to be the case.