Closed abracarambar closed 3 years ago
I think you just need to drop the (?P<discard_2>.+$)
from the end of your regex pattern.
That last group will only match when there is one or more base after the UMI, which I suspect only happens if the UMI is 13 nt due to an insertion :man_shrugging:
So your pattern becomes .+(?P<discard_1>AACTGTAGGCACCATCAAT){s<=2}(?P<umi_1>.{12})$
instead.
Getting regexes right can be tricky. I normally use e.g https://regex101.com/ to test them out. This also allows you to 'debug' your regex by stepping through the string to see where the issue is. The above website uses the base re
module which doesn't support fuzzy matching so you need to take out the {s<=2}
when testing.
Did this work?
Thank you so much, it did! I wasn't sure if there would be trailing adapters after the UMI but that does not seem to be the case.
Hi, I am trying to extract UMI out of a QIAseq miRNA Library (single ended). here is an example of read: @A00152:427:HFNTNDRXY:1:2101:7220:1000 1:N:0:NTCACTATGT+CTACCGAATT ANCACCGATGGAATGGCTTGGAGAAACTGTAGGCACCATCAATCGCCAGTGTAAG Adapter in bold, UMI in italics
I am using the following command:
umi_tools extract --extract-method=regex --bc-pattern=".+(?PAACTGTAGGCACCATCAAT){s<=2}(?P.{12})(?P.+$)" -I sample.fastq.gz -S sample.ed.fastq.gz
But it only manages to detect the regex for a small portion of the reads in the fastq file 2021-07-07 13:00:37,954 INFO regex does not match read1: 22298307 2021-07-07 13:00:37,954 INFO regex matches read1: 156155
The adapter AACTGTAGGCACCATCAAT is definitely detected in most reads if I use cutadapt. Is there something else that is not specified properly in my command? Thanks for the help.