Closed unique379r closed 9 months ago
I'm a little bit confused by your graphic - you seem to have P5 AND P7 at same end of the fragment?
If you have 2 UMIs within a single read - one at the 5' end of read1 and one at the 3' of read1, then you can extract these using a regex based bc-pattern
.
Say you had 6bp UMIs at each end of the read, then
--extract-method=regex --bc-pattern='^(?P<umi_1>.{6}).+(?P<umi_2>.{6})$'
The ^
and $
anchor the pattern at the beginning and the end of the read sequence respectively, two named capture groups then capture the UMI at the start and end of the read, with an unspecified number of bases inbetween that are retained on the read.
Hi Thank you for your reply, I am still a bit confused about the pattern:
Here are my paired-end reads, can you suggest to me based on this, how to make the pattern for pair-end? I have been told that UMI in these seq are 3 bp long of A and B as fig represent.
zcat ../bcm-umi_input/HL7VJDRX3-1-IDUDIN0013_S1_L001_R1_001.fastq.gz | head -4
@A00976:625:HL7VJDRX3:1:2101:29948:1031 1:N:0:TCGAACACGA+NCGAGGTTCT
CNGTCTGATTCAACAAAAAGTGTTTTTCAGAACTGCTCTATCAAAAGAAAGATCCACCTCTGTTAGCTGAGTTCACACATCACAAACAAGTTTATGAGAATGCTTCTGTCTAGTTTTTATTTGAAGATATTTCCTGTCTCACCATAGAGCT
+
:#,FF:FFFFF:FFF,:,FFFFFF,:FFF:FF:FFF:,F:,:,F::,,F:F,::FFFFFF:F,F,FFFFF:F,FFF:FF:FFFFFF:,,FF:::F:F,FFF:,FF:FFF:FFF,F,F::,,F,:F:F:FFFF,,F,,,F:F,F,F:FFFF:
zcat ../bcm-umi_input/HL7VJDRX3-1-IDUDIN0013_S1_L001_R2_001.fastq.gz | head -4
@A00976:625:HL7VJDRX3:1:2101:29948:1031 2:N:0:TCGAACACGA+NCGAGGTTCT
TTAGTCATTCAAGTCACAGAGTTGAACATTCCCTTTCGTACAGCAGTTTTGAAACACTCTTTCTGTAGTAAATTGAAGTGAACATTAGGACAGCTTTCAGCTCTATGGTGAGAAAGGAAATAACTTCAAATAAAAACTAGACAGAAGCATT
+
FF,,,:FFF,FF:F:FFFFFF:F:FF:FFF:F,FF,FFFFFF:FFFFFFFFFF:FF,FFFFFFFFFFFFFF,F,:FFF,,FFFFF:F,:F,F,FF::F:,FFFF:F,:,FFFF:FF,F:F,F,,F::FFF,FFFF:FFF:FFFFFF:,FFF
The pattern I have you above would take the first 6 bases of read1 and the last 6 bases of read1 and add them to the header of read1 and read2. If you wanted 3bp sequences, then change the 6s in the pattern to 3. This is what I think you mean from reading your title?
However, your figure makes it seem like you have a 3nt UMI at the start of read1 and a 3nt UMI at the start of read2. In that case, you would --bc-pattern2
as well as ---bc-pattern
:
--bc-pattern=NNN --bc-pattern2=NNN
@unique379r - Closing now due to inactivity
I am currently seeking assistance in accurately identifying genuine PCR duplicates and unique molecular identifiers (UMI) within my RNA sequencing samples. I am aware that the RNAseq data we have acquired includes 5' and 3' UMIs within Read 1 (see the attachment).
I have reviewed your tutorial, where I noticed that the example provided only pertained to the 5' end. However, I am somewhat perplexed as to how to handle both the 5' and 3' ends when using the --bc-pattern.
Could you kindly provide guidance on the appropriate approach to address this issue? Specifically, what should I specify for the --bc-pattern parameter to account for both the 5' and 3' UMI?