ablab / IsoQuant

Transcript discovery and quantification with long RNA reads (Nanopores and PacBio)
Other
128 stars 11 forks source link

Isoquant Run Time, RAM Usage and Logging on Single-cell Long Read Data #131

Closed GaryWang7 closed 2 months ago

GaryWang7 commented 6 months ago

Hi,

Thank you for developing this tool. I have been trying to use IsoQuant with my data, but the run takes a lot of RAM usage and a long time (not finished yet).

Specifically,

My questions:

  1. Would it be possible to improve the RAM usage and CPU efficiency for single-cell datasets? I know this might be too much to ask, but it would be great if there could be improvements so IsoQuant can be run on my local workstation with less RAM. I read from a previous issue that this may be caused by gene regions having excessive reads, and I think this might be the cause in my case.
  2. Is it possible to get a live output in the command line like "xxxx/yyyyy reads processed" for each chromosome? The log has been stuck at these two lines for quite a while. I have no idea what is happening (like if there is an error in parallelization or just too many reads) image
  3. Would you recommend any demultiplexing/barcode extraction tools for sc long-read ONT data from 10X cDNA to use with isoquant? I guess after running IsoQuant I have to write some code assigning the transcripts to each individual cell for now.

My command line for Isoquant: isoquant.py --reference {reference fa} --genedb {GENCODE reference gtf} --bam {bam file} --threads 36 --sqanti_output --data_type nanopore -o {output folder} Thanks again for bringing this tool! Gary

andrewprzh commented 6 months ago

Dear @GaryWang7

Thanks for the feedback!

Would it be possible to improve the RAM usage and CPU efficiency for single-cell datasets? I know this might be too much to ask, but it would be great if there could be improvements so IsoQuant can be run on my local workstation with less RAM. I read from a previous issue that this may be caused by gene regions having excessive reads, and I think this might be the cause in my case.

One of the way of reducing RAM is lowering the number of threads. since each thread works independently. CPU efficiency may also depend on the disk IO. Reading a BAM file in 36 threads simultaneously might be not very efficient.

Overall you are right about extremely high-coverage genomic regions. Since I'm working on large single-cell datasets, I aim to work on performance improvements at some point.

Is it possible to get a live output in the command line like "xxxx/yyyyy reads processed" for each chromosome? The log has been stuck at these two lines for quite a while. I have no idea what is happening (like if there is an error in parallelization or just too many reads)

That's sounds like a cool idea for long-processing chromosomes. Either I will do that for the next release, or allow different threads to process different parts of the same chromosome (which is a bit tricky architecture-wise).

Would you recommend any demultiplexing/barcode extraction tools for sc long-read ONT data from 10X cDNA to use with isoquant? I guess after running IsoQuant I have to write some code assigning the transcripts to each individual cell for now

I'm using something of my own, which is quite naive, but based on simulated data yields high (99%+) precision and okay recall. It's available for beta-testing if you'd like to try it out, but it's not published or released yet.

Best Andrey

andrewprzh commented 2 months ago

New IsoQuant 3.4 is released. It has better performance, especially for long-processing chromosome, like the one in the figure above. On my test dataset running time decreased from 3 days down to 3-4 hours. Closing this issue for now.