Closed PavithraV0223 closed 9 months ago
Your regex contains the following elements:
^ - Start of the read
(?P
(?P
G{3,5} - between 3 and 5 Gs
(?P
$ - The end of the read.
So you specified that your construct should be Index5--12UMI--5Gs--Gene-of-interest-Index7
But your pattern specifies
discard--12UMI--3-5Gs--6CellBC--end
You have a cell barcode sequence that is not in your description of your construct, and no space for the gene of interest. You normally wouldn't need to specify space for the gene of interest, because the barcodes would be at the start of the read, and then everything after would be retained, but you've specifically specified that there isn't allowed to be any sequence after the barcodes (by using $).
Message ID: @.***>
Hey Ian, Thanks for your quick response. Is there a way to retain UMI , like grouping them based on the UMI for Nanopore data. using fastq file.
I'm not sure I quite follow. Do you want to discard the index, and move the UMI to the read name and group?
If so, then
umi_tools extract --extract-method=regex
--bc-pattern='(?P<discard_1>CTACACTCTTTCCCTACACGACGCTCTTCCGATCT){e<=2}(?P<umi_1>.{12})(?<discard_2>G{3,4})'
-I reads.fastq
Followed by mapping the reads according to your favourite method and then grouping with
umi_tools group -I reads.bam
should do it.
If you want to leave the UMI on the read then
umi_tools extract --extract-method=regex
--bc-pattern='(?P<discard_1>CTACACTCTTTCCCTACACGACGCTCTTCCGATCT){e<=2}(?P<umi_1>.{12})(?<discard_2>G{3,4})'
-I reads.fastq --read2-in=reads.fastq --read2-out=processed_reads.fastq --S
/dev/null
will do that, but it will also leave the index on the reads as well, which you'd have to remove with some other tool.
Hi Ian,
I have two questions regarding this command:
umi_tools extract --extract-method=regex
--bc-pattern='(?P<discard_1>CTACACTCTTTCCCTACACGACGCTCTTCCGATCT){e<=2}(?P<umi_1>.{12})(?<discard_2>G{3,4})'
-I reads.fastq
Thanks!!
Will the sequences upstream of the discard sequence (CTACACTCTTTCCCTACACGACGCTCTTCCGATCT) also be removed? If not, any way to remove them if they are with unknown length?
No. Although you could make it remove any sequence prior to the discard sequence by including .+
in the it:
'(?P<discard_1>.+CTACACTCTTTCCCTACACGACGCTCTTCCGATCT){e<=2}(?P<umi_1>.{12})(?<discard_2>G{3,4})'
However, UMI-tools isn't really designed as a general purpose adaptor trimmer, and you might find you get more flexibility/sensitivity removing these sequences using a tools designed for this purpose.
Since only one strand is sequenced in most nanopore data, will the reverse complement of the provided sequence be also checked?
No, only the provided sequence is checked, not the reverse complement. If you want to also check for the reverse complement, check for the forward sequence, capturing the reads that don't match into a seperate file. Then process this file using the reverse complement sequence. Take the two resulting processed files and concatenate them together.
Thank you, Ian!!
Hello, I'm trying to add in the regex method to extract the UMI and add them to the read name before building the consensus. I'm dealing with Nanopore data. My construct is something like this (R1--Index5--12UMI--5Gs--Gene-of-interest--R2--Index7--adapter). I'm trying to retain (12UMI Random bases + 5Gs + Gene of interest+ Index 7 (treating this as a cell barcode) . I also want to allow mismatch of 0.1 or 0.2 in the UMI and the 5 constant Gs which is followed by the UMI . Aiming to achieve - read name (UMI+Gs+cell barcode ----gene of interest )
The trimmed fastq as a input is free from (trimmed) the 3' end adapter and R1. My -p is --bc-pattern="^(?PCTACACTCTTTCCCTACACGACGCTCTTCCGATCT)(?P.{12}){e<=2}G{3,5}(?P.{6})$".
With this, I'm getting an empty file. Should I modify anything in the code since I'm dealing with the Nanopore data keeping in mind the un oriented of the strands Retaining the UMI and adding them as a read name is quite confusing in this context, it will be great if you could assist me on this bit.