CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

Issue with extraction method, obtaining empty file as an output #610

Closed PavithraV0223 closed 9 months ago

PavithraV0223 commented 1 year ago

Hello, I'm trying to add in the regex method to extract the UMI and add them to the read name before building the consensus. I'm dealing with Nanopore data. My construct is something like this (R1--Index5--12UMI--5Gs--Gene-of-interest--R2--Index7--adapter). I'm trying to retain (12UMI Random bases + 5Gs + Gene of interest+ Index 7 (treating this as a cell barcode) . I also want to allow mismatch of 0.1 or 0.2 in the UMI and the 5 constant Gs which is followed by the UMI . Aiming to achieve - read name (UMI+Gs+cell barcode ----gene of interest )

The trimmed fastq as a input is free from (trimmed) the 3' end adapter and R1. My -p is --bc-pattern="^(?PCTACACTCTTTCCCTACACGACGCTCTTCCGATCT)(?P.{12}){e<=2}G{3,5}(?P.{6})$".

With this, I'm getting an empty file. Should I modify anything in the code since I'm dealing with the Nanopore data keeping in mind the un oriented of the strands Retaining the UMI and adding them as a read name is quite confusing in this context, it will be great if you could assist me on this bit.

IanSudbery commented 1 year ago

Your regex contains the following elements:

^ - Start of the read

(?PCTACACTCTTTCCCTACACGACGCTCTTCCGATCT) - A sequence of bases which you want removing and discarding

(?P.{12}){e<=2} - A twelve nucleotide UMI, which you've labeled umi_n (it should be umi_1 since its the first UMI in the pattern). You've also allowed upto 2 mismatches in the UMI, but since you don't specify any sequence for the UMI, then this isn't going to have any effect.

G{3,5} - between 3 and 5 Gs

(?P.{6}) - A 6 nucleotide cell barcode, which you've labeled cell_2. Note that you should probably be calling this cell_1, since its the first cell barcode in the patter.

$ - The end of the read.

So you specified that your construct should be Index5--12UMI--5Gs--Gene-of-interest-Index7

But your pattern specifies

discard--12UMI--3-5Gs--6CellBC--end

You have a cell barcode sequence that is not in your description of your construct, and no space for the gene of interest. You normally wouldn't need to specify space for the gene of interest, because the barcodes would be at the start of the read, and then everything after would be retained, but you've specifically specified that there isn't allowed to be any sequence after the barcodes (by using $).

Message ID: @.***>

PavithraV0223 commented 1 year ago

Hey Ian, Thanks for your quick response. Is there a way to retain UMI , like grouping them based on the UMI for Nanopore data. using fastq file.

IanSudbery commented 1 year ago

I'm not sure I quite follow. Do you want to discard the index, and move the UMI to the read name and group?

If so, then

umi_tools extract --extract-method=regex
--bc-pattern='(?P<discard_1>CTACACTCTTTCCCTACACGACGCTCTTCCGATCT){e<=2}(?P<umi_1>.{12})(?<discard_2>G{3,4})'
-I reads.fastq

Followed by mapping the reads according to your favourite method and then grouping with

umi_tools group -I reads.bam

should do it.

If you want to leave the UMI on the read then

umi_tools extract --extract-method=regex
--bc-pattern='(?P<discard_1>CTACACTCTTTCCCTACACGACGCTCTTCCGATCT){e<=2}(?P<umi_1>.{12})(?<discard_2>G{3,4})'
-I reads.fastq --read2-in=reads.fastq --read2-out=processed_reads.fastq --S
/dev/null

will do that, but it will also leave the index on the reads as well, which you'd have to remove with some other tool.

kwglam commented 1 year ago

Hi Ian,

I have two questions regarding this command:

umi_tools extract --extract-method=regex
--bc-pattern='(?P<discard_1>CTACACTCTTTCCCTACACGACGCTCTTCCGATCT){e<=2}(?P<umi_1>.{12})(?<discard_2>G{3,4})'
-I reads.fastq
  1. Will the sequences upstream of the discard sequence (CTACACTCTTTCCCTACACGACGCTCTTCCGATCT) also be removed? If not, any way to remove them if they are with unknown length?
  2. Since only one strand is sequenced in most nanopore data, will the reverse complement of the provided sequence be also checked?

Thanks!!

IanSudbery commented 1 year ago

Will the sequences upstream of the discard sequence (CTACACTCTTTCCCTACACGACGCTCTTCCGATCT) also be removed? If not, any way to remove them if they are with unknown length?

No. Although you could make it remove any sequence prior to the discard sequence by including .+ in the it:

'(?P<discard_1>.+CTACACTCTTTCCCTACACGACGCTCTTCCGATCT){e<=2}(?P<umi_1>.{12})(?<discard_2>G{3,4})'

However, UMI-tools isn't really designed as a general purpose adaptor trimmer, and you might find you get more flexibility/sensitivity removing these sequences using a tools designed for this purpose.

Since only one strand is sequenced in most nanopore data, will the reverse complement of the provided sequence be also checked?

No, only the provided sequence is checked, not the reverse complement. If you want to also check for the reverse complement, check for the forward sequence, capturing the reads that don't match into a seperate file. Then process this file using the reverse complement sequence. Take the two resulting processed files and concatenate them together.

kwglam commented 1 year ago

Thank you, Ian!!