Isoquant Run Time, RAM Usage and Logging on Single-cell Long Read Data

GaryWang7 commented 6 months ago

Hi,

Thank you for developing this tool. I have been trying to use IsoQuant with my data, but the run takes a lot of RAM usage and a long time (not finished yet).

Specifically,

My data: single cell long-read ONT generated from 10X cDNA, demultiplexed by BLAZE (133GB fastq), then mapped to mouse genome by Minimap2 and converted to BAM file, which is 61.9 GB.
The run time has been 48 hours with 36 threads. The average RAM usage is 154.4 GB. and average CPU efficiency is 60.85% (based on statistics from the High Performance Cluster where I am running the data). It took almost 24 hours to process chr17.

My questions:

Would it be possible to improve the RAM usage and CPU efficiency for single-cell datasets? I know this might be too much to ask, but it would be great if there could be improvements so IsoQuant can be run on my local workstation with less RAM. I read from a previous issue that this may be caused by gene regions having excessive reads, and I think this might be the cause in my case.
Is it possible to get a live output in the command line like "xxxx/yyyyy reads processed" for each chromosome? The log has been stuck at these two lines for quite a while. I have no idea what is happening (like if there is an error in parallelization or just too many reads)
Would you recommend any demultiplexing/barcode extraction tools for sc long-read ONT data from 10X cDNA to use with isoquant? I guess after running IsoQuant I have to write some code assigning the transcripts to each individual cell for now.

My command line for Isoquant: isoquant.py --reference {reference fa} --genedb {GENCODE reference gtf} --bam {bam file} --threads 36 --sqanti_output --data_type nanopore -o {output folder} Thanks again for bringing this tool! Gary

andrewprzh commented 6 months ago

Dear @GaryWang7

Thanks for the feedback!

Would it be possible to improve the RAM usage and CPU efficiency for single-cell datasets? I know this might be too much to ask, but it would be great if there could be improvements so IsoQuant can be run on my local workstation with less RAM. I read from a previous issue that this may be caused by gene regions having excessive reads, and I think this might be the cause in my case.

One of the way of reducing RAM is lowering the number of threads. since each thread works independently. CPU efficiency may also depend on the disk IO. Reading a BAM file in 36 threads simultaneously might be not very efficient.

Overall you are right about extremely high-coverage genomic regions. Since I'm working on large single-cell datasets, I aim to work on performance improvements at some point.

Is it possible to get a live output in the command line like "xxxx/yyyyy reads processed" for each chromosome? The log has been stuck at these two lines for quite a while. I have no idea what is happening (like if there is an error in parallelization or just too many reads)

That's sounds like a cool idea for long-processing chromosomes. Either I will do that for the next release, or allow different threads to process different parts of the same chromosome (which is a bit tricky architecture-wise).

Would you recommend any demultiplexing/barcode extraction tools for sc long-read ONT data from 10X cDNA to use with isoquant? I guess after running IsoQuant I have to write some code assigning the transcripts to each individual cell for now

I'm using something of my own, which is quite naive, but based on simulated data yields high (99%+) precision and okay recall. It's available for beta-testing if you'd like to try it out, but it's not published or released yet.

Best Andrey

andrewprzh commented 2 months ago

New IsoQuant 3.4 is released. It has better performance, especially for long-processing chromosome, like the one in the figure above. On my test dataset running time decreased from 3 days down to 3-4 hours. Closing this issue for now.

ablab / IsoQuant

Isoquant Run Time, RAM Usage and Logging on Single-cell Long Read Data #131