Daniel-Liu-c0deb0t / UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
MIT License
64 stars 8 forks source link

Very slow paired reads mode for transcriptome #31

Open siddharthab opened 2 months ago

siddharthab commented 2 months ago

Hi!

I am trying to make UMICollapse the default tool in one of the popular RNAseq analysis pipelines -- https://github.com/nf-core/rnaseq/issues/1087.

Not sure if this is covered by #5 already, but when using paired reads aligned to the human transcriptome, it seems like UMICollapse is 20x slower when compared to umi-tools. UMICollapse takes between 9-10 hours for the BAM files we are considering, whereas umi-tools takes ~30 minutes. The slowness is present in both two-pass and single pass modes.

I have not gone through how UMICollapse works, so I do not have an opinion on whether this is expected or not. If it is expected, some commentary on this in the README would be appreciated.

I have made some test data available in Google Drive. You will notice that the BAM file has 44319354 read pairs with 8 bp UMIs.

Thank you for continuing to follow up on your work from a long time ago.

siddharthab commented 2 months ago

On profiling, it seems like 98% of the CPU is spent in write.

Screenshot 2024-09-03 at 9 44 06 PM