CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

How can use umi-tools for umi present on read1 and in both the end like 5' and 3' end? #606

Closed unique379r closed 9 months ago

unique379r commented 1 year ago

I am currently seeking assistance in accurately identifying genuine PCR duplicates and unique molecular identifiers (UMI) within my RNA sequencing samples. I am aware that the RNAseq data we have acquired includes 5' and 3' UMIs within Read 1 (see the attachment).

I have reviewed your tutorial, where I noticed that the example provided only pertained to the 5' end. However, I am somewhat perplexed as to how to handle both the 5' and 3' ends when using the --bc-pattern.

Could you kindly provide guidance on the appropriate approach to address this issue? Specifically, what should I specify for the --bc-pattern parameter to account for both the 5' and 3' UMI?

Screenshot 2023-10-19 at 12 14 12 PM

IanSudbery commented 1 year ago

I'm a little bit confused by your graphic - you seem to have P5 AND P7 at same end of the fragment?

If you have 2 UMIs within a single read - one at the 5' end of read1 and one at the 3' of read1, then you can extract these using a regex based bc-pattern.

Say you had 6bp UMIs at each end of the read, then

--extract-method=regex --bc-pattern='^(?P<umi_1>.{6}).+(?P<umi_2>.{6})$'

The ^ and $ anchor the pattern at the beginning and the end of the read sequence respectively, two named capture groups then capture the UMI at the start and end of the read, with an unspecified number of bases inbetween that are retained on the read.

unique379r commented 12 months ago

Hi Thank you for your reply, I am still a bit confused about the pattern:

Here are my paired-end reads, can you suggest to me based on this, how to make the pattern for pair-end? I have been told that UMI in these seq are 3 bp long of A and B as fig represent.

zcat ../bcm-umi_input/HL7VJDRX3-1-IDUDIN0013_S1_L001_R1_001.fastq.gz | head -4

@A00976:625:HL7VJDRX3:1:2101:29948:1031 1:N:0:TCGAACACGA+NCGAGGTTCT
CNGTCTGATTCAACAAAAAGTGTTTTTCAGAACTGCTCTATCAAAAGAAAGATCCACCTCTGTTAGCTGAGTTCACACATCACAAACAAGTTTATGAGAATGCTTCTGTCTAGTTTTTATTTGAAGATATTTCCTGTCTCACCATAGAGCT
+
:#,FF:FFFFF:FFF,:,FFFFFF,:FFF:FF:FFF:,F:,:,F::,,F:F,::FFFFFF:F,F,FFFFF:F,FFF:FF:FFFFFF:,,FF:::F:F,FFF:,FF:FFF:FFF,F,F::,,F,:F:F:FFFF,,F,,,F:F,F,F:FFFF:

zcat ../bcm-umi_input/HL7VJDRX3-1-IDUDIN0013_S1_L001_R2_001.fastq.gz | head -4

@A00976:625:HL7VJDRX3:1:2101:29948:1031 2:N:0:TCGAACACGA+NCGAGGTTCT
TTAGTCATTCAAGTCACAGAGTTGAACATTCCCTTTCGTACAGCAGTTTTGAAACACTCTTTCTGTAGTAAATTGAAGTGAACATTAGGACAGCTTTCAGCTCTATGGTGAGAAAGGAAATAACTTCAAATAAAAACTAGACAGAAGCATT
+
FF,,,:FFF,FF:F:FFFFFF:F:FF:FFF:F,FF,FFFFFF:FFFFFFFFFF:FF,FFFFFFFFFFFFFF,F,:FFF,,FFFFF:F,:F,F,FF::F:,FFFF:F,:,FFFF:FF,F:F,F,,F::FFF,FFFF:FFF:FFFFFF:,FFF
IanSudbery commented 12 months ago

The pattern I have you above would take the first 6 bases of read1 and the last 6 bases of read1 and add them to the header of read1 and read2. If you wanted 3bp sequences, then change the 6s in the pattern to 3. This is what I think you mean from reading your title?

However, your figure makes it seem like you have a 3nt UMI at the start of read1 and a 3nt UMI at the start of read2. In that case, you would --bc-pattern2 as well as ---bc-pattern:

--bc-pattern=NNN --bc-pattern2=NNN
TomSmithCGAT commented 9 months ago

@unique379r - Closing now due to inactivity