Closed kate-simonova closed 8 months ago
UMIs are normally used in situations where a large amount of duplication is expected. So for the applications we normally see UMI-tools being used, losing 3/4 of the reads is perfectly normal. However, one wouldn't normally expect a duplication rate of 75% in bulk RNAseq (which is why people don't often include UMIs in bulk RNAseq). That said, is there some reason to expect a high duplication rate, and that is why UMIs were included in the experiment in the first place? The most common possibilities would be low amounts of input RNA, or poor quality RNA. But it might also be the case if this is some sort of targeted RNAseq, or the number of reads sequenced is very high compared to the size of the transcriptome being sequenced.
Just to add to the above, you can run with --ignore-umi
to treat your data as if it didn't include UMIs and deduplicate by position only (you can also drop --output-stats
). By comparing the resultant number of reads output +/- UMIs, you can then determine how many extra reads are retained with the UMIs. I find this is helpful to get a handle on how much benefit there was in including UMIs.
Closing due to inactivity
I am relatively new to the UMI field and I have the data.
If I extract UMI sequence from the read sequence and add it to the read names, align reads and perform deduplication I have only 1/4 of the inputfile. I checked the discussians here and I see both reads in my deduplicated bam file. The input bam for deduplication is sorted and indexed. My main question is it normal that umi_tools reduces the file size by 3/4 of the input file.
The command I run:
umi_tools dedup -I my.bam -S dedup.bam --paired --output-stats stats -L logfile.log