ablab / IsoQuant

Transcript discovery and quantification with long RNA reads (Nanopores and PacBio)
https://ablab.github.io/IsoQuant/
Other
142 stars 12 forks source link

Memory exceeded when running single cell data alignment #189

Closed Qirongmao97 closed 1 week ago

Qirongmao97 commented 3 months ago

plot_zoom_png (1)

Hi,

I tried running a task with a memory limit of 250GB, but it kept going over the limit.

I'm thinking the problem might be related to how I'm using the --read_group input. Right now, I'm using a file called putative_bc.csv from BLAZE to sort out barcodes. Do you know a better way to align things using BLAZE results with IsoQuant?"

Originally posted by @Qirongmao97 in https://github.com/ablab/IsoQuant/issues/165#issuecomment-2115461354

lianov commented 3 months ago

Agreed, we are also seeing this with the latest version / v3.4.1 (same approach for single-cell data with the use of --read_group). Before (version 3.3.1) the same sample used ~194 GB, and now it is using 2.08TB (and not respecting the memory limit which we have also set for 250GB).

andrewprzh commented 3 months ago

@Qirongmao97 @lianov

How do you set the memory limit? I think exceeding the memory limit might be related to python multiprocessing.

It seems like RAM peak occurs right at end when results are being merged. I will run IsoQuant on some single-cell data I have and check its memory consumption.

Best Andrey

lianov commented 3 months ago

@andrewprzh : in my case it was set as part of the NextFlow job from our nf-core/scnanoseq pipeline. Upon checking the nextflow logs, it reported much higher usage without causing the job to fail in SLURM (which is another issue as well, as it should have failed). If you need more details, please let me know. Thanks for looking into this!

andrewprzh commented 3 months ago

I tested IsoQuant on some very large dataset with 70K barcodes, it does take a lot of RAM. I'll start investigating the issue. I think it might be partially caused by the Python multiprocessing mechanisms.

I'll keep you updated.

Best Andrey

Qirongmao97 commented 3 months ago

Hi Andrew @andrewprzh ,

I'm running IsoQuant on the Visium dataset with only 5K barcodes. Technically, it should not require much RAM, right? I was wondering if there might be an issue related to the input from the BLAZE demultiplexing step. Currently, I am also trying the nf-core/scnanoseq pipeline, but it would be great if you could share your pipeline for processing single-cell data with IsoQuant.

Thanks!

ljwharbers commented 3 months ago

Hi @andrewprzh ,

Just commenting to let you know that I was running into the same issues. I have data from some custom spatial transcriptomics data with closer to a (few) million(s) of 'barcodes'. While I have some nodes with multiple TB of memory available, I'm afraid that (after reading these comments) this won't be enough for my dataset. I got a run scheduled with 2TB of RAM this evening, so I will update when I know more.

Do you have any potential fix in mind or could you point me at the (most likely) chunk of code where this occurs, and I can have a look as well.

Edit: Indeed as expected it also runs out of memory when allocating 2TB of memory sadly.

Thanks, Luuk

andrewprzh commented 3 months ago

@Qirongmao97

Currently I use a barcode calling tool of my own, which will become a part of IsoQuant at some point. In fact, I don't think it matters how the barcodes are called, it's the number of distinct barcodes that matters. How many barcodes you have in total?

Best Andrey

andrewprzh commented 3 months ago

@ljwharbers

A few million barcodes is really a lot, and it's kind of expected to consume a lot of RAM. Do all of these barcodes represent real cells or there's is chance to apply some filtering?

Best Andrey

Qirongmao97 commented 3 months ago

@andrewprzh

Hi, in this Visium dataset we have 3700 cells (spots)

ljwharbers commented 3 months ago

@andrewprzh

I realize it's a bit of an extreme scenario :') These are barcodes representing real spatial coordinates (so not really cell but for analysis purposes it doesn't matter). Each barcode has only very few unique reads (and thus genes/transcripts) associated with them.

I think the main problem here is that, if I understand it correctly, a 'cell' x gene/transcript matrix is always generated and this consumes a huge amount of memory.

Could a solution be to (have an option) to not generate the output in this 'wide' format, but in a 'long' format instead. I can imagine that this would save a lot of memory.

andrewprzh commented 2 months ago

@ljwharbers

Yes, the matrix is always stored in some way. Previously, IsoQuant was outputting the "long" format, but then we decided to use "wide" format for everything. I'll what I can do to make a workaround.

andrewprzh commented 2 months ago

@Qirongmao97

3700 is not really a lot... I also see that RAM peak occurs at then, probably when the counts are merged into a single table.

ljwharbers commented 2 months ago

@ljwharbers

Yes, the matrix is always stored in some way. Previously, IsoQuant was outputting the "long" format, but then we decided to use "wide" format for everything. I'll what I can do to make a workaround.

Thanks, that would be amazing!

lianov commented 2 months ago

@andrewprzh : thank you for your work on this once again. Do you foresee a fix on this issue in the near future? We are getting close to final review with nf-core on scnanoseq and for now, we have chosen to downgrade isoquant to 3.3.1 as a temporary fix to this issue. If you think there might be a fix to this issue in the near future, could you let us know so we can attempt to update the pipeline before first release with a new version of isoquant?

If not, no problem - we will aim to release a patch as soon as it is available. Thank you again.

andrewprzh commented 2 months ago

@lianov

Unfortunately, I'm quite busy with other projects and trying to work on IsoQuant in between. I think using 3.3.1 for now is a good solution since I cannot predict the timeline at the moment... I will keep you updated anyway.

Best Andrey

lianov commented 2 months ago

@andrewprzh : No problem, totally get it and thank you for the quick reply. We will move forward with this plan in the meantime.

andrewprzh commented 2 months ago

Makes sense, good luck and stay tuned :)

andrewprzh commented 1 month ago

@lianov

New 3.4.2 consumes significantly less memory compared to 3.4.1.

However, there still might be issues with single-cell data, which I'm still working on.

Best Andrey

lianov commented 1 month ago

@andrewprzh : thank you for the update. @atrull314 and I will be looking into this new release for sure for performance in single-cell data. Thank you for your continued updates etc!

andrewprzh commented 1 month ago

@Qirongmao97 @lianov @ljwharbers

New IsoQuant 3.5 should consume far less RAM when using read groups, for gene, transcript and exon counts too.

It also outputs grouped counts in both - matrix and linear formats.

Best Andrey

lianov commented 1 month ago

@andrewprzh : Great, we will be trying this out asap. Thank you again for your updates.

ljwharbers commented 1 month ago

@andrewprzh this is amazing, thanks! I'm testing it now and it runs smoothly so far, no memory issues (and this is with ~50 million barcodes!). Amazing work!

lianov commented 1 month ago

@andrewprzh : just to follow-up on our end. We are also seeing improvements in memory with this latest version after some pre-lim tests (~80GB with a PromethION dataset). We will continue to test on other datasets and upgrade the pipeline ASAP to be released with IsoQuant 3.5.

lianov commented 2 weeks ago

Follow-up here to close the loop on our end at least: fully tested this version across the datasets and we can confirm better performance. Also quantification sensitivity on this latest version is much better than before! Thanks for all the improvements! [this latest version is implemented in the scnanoseq pipeline and we are very close to releasing it on our end.

andrewprzh commented 2 weeks ago

Thanks a lot for getting back and happy to hear about positive results! And thank you for embedding IsoQuant into you pipeline!

ljwharbers commented 1 week ago

Also follow-up from my side. I've ran the latest version with >50 million barcodes and there are no memory issues any more. The run time is (very) long due to outputting in dense matrix format, typically days for my dataset. After simply commenting out lines that write in matrix format, everything processed in a couple of hours.

Super impressed with the speed and sensitivity. I will also be including isoquant in my nf-core pipeline (which is still a bit away from being released).

Thanks for your continued work and your quick responses!

lianov commented 1 week ago

@ljwharbers : that's good info on tracking down the run time source. On most of our datasets with default threads it takes about ~8hr, but this is helpful to us and maybe an area that we can also aid in contributing in the future.

ljwharbers commented 1 week ago

I think that ultimately the best option would be to, during processing, have the intermediate files in the 'linear' format and only in the final merging step transform it to a (sparse) matrix or linear format, depending on the user requirement. This would save a lot of time even if the user wants the output in matrix format.

@lianov I simply have a small script to change the linear format into a sparse mtx, which is compatible with (almost) all downstream single-cell processing tools.

While writing this, I see that @andrewprzh just released v3.5.1 already with the ability for the user to specify the output format. Amazing work once again!

andrewprzh commented 1 week ago

@ljwharbers

Thanks for the feedback! For now I implemented a simple option --counts_format, but I'll rework counts output in a more optimal way to avoid merging large files. Interestingly, linear format was the default option previously for grouped counts with the large number of groups, but somehow we decided to switch to matrix format.

I'll close this issue for now, feel free to reopen or start a discussion if needed.