CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

Best way to handle non-UMI labelled reads #616

Closed kvn95ss closed 7 months ago

kvn95ss commented 10 months ago

Hello!

When using string extract method or regex method, both assume all reads are tagged with UMI. However, depending on the technology (Which is smart-seq3 in my case), there are internal reads without any UMIs.

What would be the best way to include these reads in the analysis, as the internal reads can make up anywhere from 20% to 60% of the reads, ignoring them seems... wasteful.

Does the below approach work to incorporate internals?

  1. Gather the filtered reads from regex using the --filtered-out option
  2. Align and deduplicate them using something like Picard MarkDuplicates
  3. Merge the bam files and generate counts

One problem would be, the internals might not be 'deduplicated' as perfectly as the UMI reads, so in downstream analysis some genes might have inflated counts. Apart from this issue, I can't think of any other downside, but any input is greatly appreciated.

IanSudbery commented 10 months ago

Personally, I feel that including reads that are not linked to UMI defeats the purpose of using a UMI. I don't remember the details for SMART-Seq3, but are UMIs attached before or after fragmentation? If the latter, than mark duplicates isn't going to be very useful.

kvn95ss commented 10 months ago

I believe they are added after fragmentation, followed by amplification.

I would agree with you, but we had tiny amount of RNA to begin with, so we would not like to loose any information from those reads.

I also had another question - using the string method treats the internal reads as UMIs as well, i.e trims the beginning of read using the --bc-pattern. While this is 'wrong', we observed higher gene counts with this method as more reads were being retained, but can I assume this will not cause a sensible deduplication of these reads?

IanSudbery commented 10 months ago

but can I assume this will not cause a sensible deduplication of these reads?

No, deduplication here will be entirely random.

If the UMIs are added after fragmentation, then deduplicating on position (such as with picard) will not be entirely random. But I can't speak to what will happen to quantification accuracy if you add two sets of reads, deduplicated in different ways, together.

kvn95ss commented 10 months ago

No, deduplication here will be entirely random.

Would that necessarily be a bad thing?

Also, I am trying to process the data both ways, and plan to use QualiMap to check for transcript coverage (Smart-seq3 is supposed to have somewhat even coverage of transcripts). If there are any deviations I'll post it here.

kvn95ss commented 10 months ago

One correction, the UMIs were added before fragmentation.

I removed the UMIs from reads containing them, used --filtered-out to obtain the internal reads and finally combined the reads together, effectively removing the UMIs from the reads. The coverage across transcripts is reasonably even.

When only looking at reads with UMI, there is a strong 5' bias (I was told it was due to UMIs being at the 5' of in the fragments).