UMI-based deduplication of output .bam file

mckellardw commented 2 years ago

Would it be possible to integrate the UMI-deduplication into .bam deduplication? From my understanding, --bamRemoveDuplicatesType currently does not support this, and I have found alternative tools to be very slow and memory-heavy. This would be a huge help for visualizing genome coverage in IGV and being able to directly compare with count matrices. It would also save a lot of disk space for the often bulky .bam files.

I would propose adding the option for --bamRemoveDuplicatesType UMI and requiring --soloUMIdedup to be set.

Thanks, David

alexdobin commented 2 years ago

Hi David,

the UMI deduplication is different from position-based deduplication. In the latter, we simply remove identical sequences, but in the former only the UMIs are identical, but the read sequences and coordinates are different. It will require making a choice of which read to consider as the representative for each UMI, and I am not sure if it's very helpful to understand the coverage.

mckellardw commented 2 years ago

My biggest issue is that when working with libraries which contain RNAs with a wide range in length (total RNAseq libraries), the relative abundance of short transcripts like miRNAs is distorted because of PCR biases. Deduplication is also very important for low-diversity libraries (targeted sequencing being a good example). I currently use umi_tools dedup to perform UMI-aware deduplication, but this is a very slow, memory-heavy process.

I hadn't thought of the need to pick a representative sequence from the PCR duplicates, which does sound non-trivial. I suppose that's why generating count matrices is so much faster than deduplicating .bam files! If there is no significant performance gain in performing bam deduplication alongside generating the count matrix, then I suppose I will just have to stick with umi_tools. Thanks for your time!!

alexdobin / STAR

UMI-based deduplication of output .bam file #1621