CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
472 stars 188 forks source link

UMI_extract for paired end reads of micro_RNA #633

Closed oneito closed 3 months ago

oneito commented 3 months ago

I am new with micro_RNA analysis. I have Illumina paired end reads micro-RNA seq data (which I did not generate) that I need to analyze. I understand with micro-RNA, extracting UMI is important due to their short nucleotide length. I have no other information regarding these data apart from conditions from which the data was generated and fastq file. Evaluating the fastq file, am not sure if UMI was added to them.

Here is example of my data: more .R1.fastq @A00124:542:H27NNDSX5:1:1101:19262:1000 1:N:0:CCTCTAAGTA+ACTGTAACGA NATTAGGGGAGATTTCAACTGTAGGCACCATCAATATTGGATCGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCTCTAAGTAATCTCGTATG +

FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFF

@A00124:542:H27NNDSX5:1:1101:20039:1000 1:N:0:CCTCTAAGTA+ACTGTAACGA NTACGTCGAGGATTACCAGCTTGTCAAACTGTAGGCACCATCAATTGCTTGTACTGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCTCTAAGTA

more .R2.fastq @A00124:542:H27NNDSX5:1:1101:19262:1000 2:N:0:CCTCTAAGTA+ACTGTAACGA ACACGATCCAATATTGATGGTGCCTACAGTTGAAATCTCCCCTAATAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTACTGTAACGAGTGTAGATCTC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF @A00124:542:H27NNDSX5:1:1101:20039:1000 2:N:0:CCTCTAAGTA+ACTGTAACGA TCAGTACAAGCAATTGATGGTGCCTACAGTTTGACAAGCTGGTAATCCTCGACGTAAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTACTGTAACGAG

For R1.fastq, has the 3' adapter sequence (AACTGTAGGCACCATCAAT) and Illumina Universal adapter - ‘AGATCGGAAGAG' separated by 12 random nucleotides. The R2.fastq has no such info sequences. Based on this, it seems to me that R1.fastq sequences are barcoded and not R2.fastq, is this right?

I have tried to extract UMI of these paired end sequences using UMI_tools with this code:

umi_tools extract --extract-method=regex --stdin ./Gff.R1.fastq \ --bc-pattern='.+(?PAACTGTAGGCACCATCAAT){s<=2}(?P.{12})(?P.+)' --read2-in=./Gff.R2.fastq \ --stdout ./output/Gff_UMIextracted.R1.fastq --read2-out=./output/Gff_UMIextracted.R2.fastq \ --log ./output/Gff_UMIextracted.log

This code ran successfully, and I got the results as below. After processing. more R1.fastq @A00124:542:H27NNDSX5:1:1101:19262:1000_ATTGGATCGTGT 1:N:0:CCTCTAAGTA+ACTGTAACGA NATTAGGGGAGATTTC +

FFFFFFFFFFFFFFF

@A00124:542:H27NNDSX5:1:1101:20039:1000_TGCTTGTACTGA 1:N:0:CCTCTAAGTA+ACTGTAACGA NTACGTCGAGGATTACCAGCTTGTCA

more R2.fastq @A00124:542:H27NNDSX5:1:1101:19262:1000_ATTGGATCGTGT 2:N:0:CCTCTAAGTA+ACTGTAACGA ACACGATCCAATATTGATGGTGCCTACAGTTGAAATCTCCCCTAATAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTACTGTAACGAGTGTAGATCTC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF @A00124:542:H27NNDSX5:1:1101:20039:1000_TGCTTGTACTGA 2:N:0:CCTCTAAGTA+ACTGTAACGA TCAGTACAAGCAATTGATGGTGCCTACAGTTTGACAAGCTGGTAATCCTCGACGTAAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTACTGTAACGAG

While there are some changes in R1.fastq file, nothing changed in R2.fastq file.

I am wondering if what I did is correct. I will appreciate if someone can validate or correct me on this.

Thank you.

TomSmithCGAT commented 3 months ago

Hi @oneito,

Nothing like a bit of detective work on data with poor metadata!

From what you've said, this sounds like the QIAseq miRNA Library Kit and from your barcode regex, it looks like you've either already picked up on Ian's comment here, or come to the same solution. For future reference, for advice/sanity checking you're likely to get a quicker response on forums like biostars, since there are other UMI-tools users that can pick it up.

Everything looks correct to me. Note that fastq field 1 (line 1) for each read now contains '_' before the space, e.g

@A00124:542:H27NNDSX5:1:1101:19262:1000 1:N:0:CCTCTAAGTA+ACTGTAACGA -> @A00124:542:H27NNDSX5:1:1101:19262:1000_ATTGGATCGTGT 1:N:0:CCTCTAAGTA+ACTGTAACGA

Both reads in the pair will have the same UMI sequence added. This maintains the read pair having the same identifier up to the first space (anything after this is dropped by the aligner).

TomSmithCGAT commented 3 months ago

Looks like this was picked up on Biostars so closing now