CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

Settings for the analysis of bulk-RNAseq with 11 bp UMI #505

Closed kate-simonova closed 8 months ago

kate-simonova commented 2 years ago

I am relatively new to the UMI field and I have the data.

If I extract UMI sequence from the read sequence and add it to the read names, align reads and perform deduplication I have only 1/4 of the inputfile. I checked the discussians here and I see both reads in my deduplicated bam file. The input bam for deduplication is sorted and indexed. My main question is it normal that umi_tools reduces the file size by 3/4 of the input file.

The command I run: umi_tools dedup -I my.bam -S dedup.bam --paired --output-stats stats -L logfile.log

IanSudbery commented 2 years ago

UMIs are normally used in situations where a large amount of duplication is expected. So for the applications we normally see UMI-tools being used, losing 3/4 of the reads is perfectly normal. However, one wouldn't normally expect a duplication rate of 75% in bulk RNAseq (which is why people don't often include UMIs in bulk RNAseq). That said, is there some reason to expect a high duplication rate, and that is why UMIs were included in the experiment in the first place? The most common possibilities would be low amounts of input RNA, or poor quality RNA. But it might also be the case if this is some sort of targeted RNAseq, or the number of reads sequenced is very high compared to the size of the transcriptome being sequenced.

TomSmithCGAT commented 2 years ago

Just to add to the above, you can run with --ignore-umi to treat your data as if it didn't include UMIs and deduplicate by position only (you can also drop --output-stats). By comparing the resultant number of reads output +/- UMIs, you can then determine how many extra reads are retained with the UMIs. I find this is helpful to get a handle on how much benefit there was in including UMIs.

TomSmithCGAT commented 8 months ago

Closing due to inactivity