Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
Not sure if this is covered by #5 already, but when using paired reads aligned to the human transcriptome, it seems like UMICollapse is 20x slower when compared to umi-tools. UMICollapse takes between 9-10 hours for the BAM files we are considering, whereas umi-tools takes ~30 minutes. The slowness is present in both two-pass and single pass modes.
I have not gone through how UMICollapse works, so I do not have an opinion on whether this is expected or not. If it is expected, some commentary on this in the README would be appreciated.
I have made some test data available in Google Drive. You will notice that the BAM file has 44319354 read pairs with 8 bp UMIs.
Thank you for continuing to follow up on your work from a long time ago.
Hi!
I am trying to make UMICollapse the default tool in one of the popular RNAseq analysis pipelines -- https://github.com/nf-core/rnaseq/issues/1087.
Not sure if this is covered by #5 already, but when using paired reads aligned to the human transcriptome, it seems like UMICollapse is 20x slower when compared to umi-tools. UMICollapse takes between 9-10 hours for the BAM files we are considering, whereas umi-tools takes ~30 minutes. The slowness is present in both two-pass and single pass modes.
I have not gone through how UMICollapse works, so I do not have an opinion on whether this is expected or not. If it is expected, some commentary on this in the README would be appreciated.
I have made some test data available in Google Drive. You will notice that the BAM file has 44319354 read pairs with 8 bp UMIs.
Thank you for continuing to follow up on your work from a long time ago.