Closed oneito closed 7 months ago
Hi @oneito,
Nothing like a bit of detective work on data with poor metadata!
From what you've said, this sounds like the QIAseq miRNA Library Kit and from your barcode regex, it looks like you've either already picked up on Ian's comment here, or come to the same solution. For future reference, for advice/sanity checking you're likely to get a quicker response on forums like biostars, since there are other UMI-tools users that can pick it up.
Everything looks correct to me. Note that fastq field 1 (line 1) for each read now contains '_
@A00124:542:H27NNDSX5:1:1101:19262:1000 1:N:0:CCTCTAAGTA+ACTGTAACGA -> @A00124:542:H27NNDSX5:1:1101:19262:1000_ATTGGATCGTGT 1:N:0:CCTCTAAGTA+ACTGTAACGA
Both reads in the pair will have the same UMI sequence added. This maintains the read pair having the same identifier up to the first space (anything after this is dropped by the aligner).
Looks like this was picked up on Biostars so closing now
I am new with micro_RNA analysis. I have Illumina paired end reads micro-RNA seq data (which I did not generate) that I need to analyze. I understand with micro-RNA, extracting UMI is important due to their short nucleotide length. I have no other information regarding these data apart from conditions from which the data was generated and fastq file. Evaluating the fastq file, am not sure if UMI was added to them.
Here is example of my data: more .R1.fastq @A00124:542:H27NNDSX5:1:1101:19262:1000 1:N:0:CCTCTAAGTA+ACTGTAACGA NATTAGGGGAGATTTCAACTGTAGGCACCATCAATATTGGATCGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCTCTAAGTAATCTCGTATG +
FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFF
@A00124:542:H27NNDSX5:1:1101:20039:1000 1:N:0:CCTCTAAGTA+ACTGTAACGA NTACGTCGAGGATTACCAGCTTGTCAAACTGTAGGCACCATCAATTGCTTGTACTGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCTCTAAGTA
more .R2.fastq @A00124:542:H27NNDSX5:1:1101:19262:1000 2:N:0:CCTCTAAGTA+ACTGTAACGA ACACGATCCAATATTGATGGTGCCTACAGTTGAAATCTCCCCTAATAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTACTGTAACGAGTGTAGATCTC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF @A00124:542:H27NNDSX5:1:1101:20039:1000 2:N:0:CCTCTAAGTA+ACTGTAACGA TCAGTACAAGCAATTGATGGTGCCTACAGTTTGACAAGCTGGTAATCCTCGACGTAAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTACTGTAACGAG
For R1.fastq, has the 3' adapter sequence (AACTGTAGGCACCATCAAT) and Illumina Universal adapter - ‘AGATCGGAAGAG' separated by 12 random nucleotides. The R2.fastq has no such info sequences. Based on this, it seems to me that R1.fastq sequences are barcoded and not R2.fastq, is this right?
I have tried to extract UMI of these paired end sequences using UMI_tools with this code:
umi_tools extract --extract-method=regex --stdin ./Gff.R1.fastq \ --bc-pattern='.+(?PAACTGTAGGCACCATCAAT){s<=2}(?P.{12})(?P.+)' --read2-in=./Gff.R2.fastq \
--stdout ./output/Gff_UMIextracted.R1.fastq --read2-out=./output/Gff_UMIextracted.R2.fastq \
--log ./output/Gff_UMIextracted.log
This code ran successfully, and I got the results as below. After processing. more R1.fastq @A00124:542:H27NNDSX5:1:1101:19262:1000_ATTGGATCGTGT 1:N:0:CCTCTAAGTA+ACTGTAACGA NATTAGGGGAGATTTC +
FFFFFFFFFFFFFFF
@A00124:542:H27NNDSX5:1:1101:20039:1000_TGCTTGTACTGA 1:N:0:CCTCTAAGTA+ACTGTAACGA NTACGTCGAGGATTACCAGCTTGTCA
more R2.fastq @A00124:542:H27NNDSX5:1:1101:19262:1000_ATTGGATCGTGT 2:N:0:CCTCTAAGTA+ACTGTAACGA ACACGATCCAATATTGATGGTGCCTACAGTTGAAATCTCCCCTAATAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTACTGTAACGAGTGTAGATCTC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF @A00124:542:H27NNDSX5:1:1101:20039:1000_TGCTTGTACTGA 2:N:0:CCTCTAAGTA+ACTGTAACGA TCAGTACAAGCAATTGATGGTGCCTACAGTTTGACAAGCTGGTAATCCTCGACGTAAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTACTGTAACGAG
While there are some changes in R1.fastq file, nothing changed in R2.fastq file.
I am wondering if what I did is correct. I will appreciate if someone can validate or correct me on this.
Thank you.