CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

UMI reads are more than Fastq reads #564

Closed zhangpicb closed 1 year ago

zhangpicb commented 1 year ago

Hi @TomSmithCGAT

Thanks for this great tool!

I download a UMI method DNA sequencing data from GEO.

fastq-dump --split-files SRR11714300

Rejected 41315 READS because READLEN < 1
Read 15047517 spots for SRR11714300
Written 15047517 spots for SRR11714300

And I got 3 files. SRR11714300_3.fastq.gz was 9bp UMI reads.

umi_tools extract --bc-pattern=NNNNNNNNN --stdin=SRR11714300_3.fastq.gz --read2-in=SRR11714300_1.fastq.gz --stdout=SRR11714300_1.umi.fastq.gz --read2-stdout

And got this message!

Read pairs do not match
SRR11714300.290 != SRR11714300.291

And I try to figure out this error,and I found SRR11714300_1.fastq.gz didn't have SRR11714300.290 read.

zcat SRR11714300_1.fastq.gz |grep -w SRR11714300.290

How to solve this error ? Thanks in advanced

TomSmithCGAT commented 1 year ago

Umi-tools needs the paired fastqs to contain the same reads in the same order.

From a quick Google, the easiest solution may be to pass your fastqs through trimmomatic first, using options that will retain all reads, but removed unpaired reads. See the command at the end of https://stackoverflow.com/questions/13203289/need-script-or-software-to-remove-unpaired-reads-from-paired-end-reads

TomSmithCGAT commented 1 year ago

Just to check though, is SRR11714300_3.fastq.gz only umis? What's in SRR11714300_2.fastq.gz?

TomSmithCGAT commented 1 year ago

Checking the sra entry for this data, it looks like you have read1, read2 and UMIs as three separate fastqs. I assume you already know this, but just in case, you'll need to run umi_tools extract twice, once for each of read1 and read 2 to get the UMI sequence into the read name for both prior to alignment.

zhangpicb commented 1 year ago

Hi @TomSmithCGAT

Thanks for your qucikly reply!

SRR11714300_1.fastq.gz didn't have SRR11714300.290 read,and SRR11714300_2fastq.gz and SRR11714300_3.fastq.gz both have SRR11714300.290.

And I used the solution that you given https://stackoverflow.com/questions/13203289/need-script-or-software-to-remove-unpaired-reads-from-paired-end-reads,Trimmomatic make 3 files have same reads.

But SRR11714300.290 is still included in SRR11714300_3.PE.fastq.gz(UMI.PE.fastq.gz).

Actually,I used trimmomatic remove unpaired reads from SRR11714300_1.fastq.gz and SRR11714300_3.fastq.gz(UMI.fastq.gz).

Seqkit pair function solve this problems.

Thanks