CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

[QUESTION] Recommended RNAseq total experiment UMI read coverage #641

Closed MrKevinDC closed 4 months ago

MrKevinDC commented 5 months ago

The dedup logfile indicates:

INFO Number of reads out: 7235130

However, Samtools flagstat reports that:

13931115 + 0 in total (QC-passed reads + QC-failed reads)

So there seems to be a 400k read discrepancy between the two, if assuming that UMI-tools is reporting paired-end read fragments. What is the explanation for such discrepancy? We couldn't figure it out.

In addition, we have observed that Mean number of unique UMIs per position normally ranges between 1-2 for our samples, is that the usual range?

Thank you in advance

IanSudbery commented 5 months ago

Thats odd. In paired mode, this is actaully the number of read1s that are output. Is it possible you have some unpaired reads (and have set unpaired to use), or reads where the pairs can't be found? (the log would let you know this if it were the case).

In terms of UMIs per position, it is entirely experiment dependent. There isn't really a "usual" range.

MrKevinDC commented 5 months ago

That was the circumstance indeed, there were more read1s than read2s. Thank you

For the second question, the experiment is total RNAseq. A mean number of around 1 UMI/position would suggest the read coverage isn't high enough to provide any benefits compared to not using UMIs, correct? From what I have been reading, >10 UMI/position is desirable?

IanSudbery commented 5 months ago

I wouldn't say that its really about read coverage:

Depends on the number of reads per position - if you have a large number of reads at a position, but a small number of UMIs, deduplication with UMIs is similar to deduplication without them. But I'm pretty sure that no one would recommend RNA-seq without UMIs.

If deduplication is not reducing the number of reads by very much, but you also have few UMIs per position, then you have low levels of PCR duplication, and deduplication was probably not neccessery.

One important thing to bear in mind is that its not really the "average" gene you want to be worried about in RNAseq, but rather the most highly expressed ones. As expression levels are generally log-normal distributed, the most highly expressed genes will have orders of magnitude more reads than the "avearge". These are the genes where you will see the most benefit from UMIs, as many reads will look like PCR duplicates just by chance.