Performance issues on a large (120724) location dataset

BayraktarLab / cell2location

Comprehensive mapping of tissue cell architecture via integrated single cell and spatial transcriptomics (cell2location model)

https://cell2location.readthedocs.io/en/latest/

Apache License 2.0

320 stars 58 forks source link

Performance issues on a large (120724) location dataset #375

Open Dillon214 opened 4 months ago

Dillon214 commented 4 months ago

Hello cell2location devs,

I am working with a large visium dataset (12240 features, 120724 locations), and am running into issues when it comes to training the cell2location.models.Cell2location model. The specific issue is very simple, it's just too slow. I was getting a speed of about 10 seconds per iteration, and across 30000 iterations I was hitting estamated completion times of about 80 hours. I am running in GPU mode, and throwing additional a40 cores at the issue didn't seem to improve speed, I was using 8 with my latest run. I also tried batching the data so each batch was about 30,000 cells, which also didn't improve speed.

Do you have any advice for me? I would be greatly appreciative, as I have used this package before on a smaller dataset to great success, and it was done training in only a few hours.

sincerely, Dillon Brownell

vitkl commented 4 months ago

Hi @Dillon214

Exciting data you have. Please have a look at this issue for practical suggestions about working with large data https://github.com/BayraktarLab/cell2location/issues/356

We find the best performance when training cell2location with batch size equal to full data. This limits batch size by GPU memory. Roughly 18k genes * 60k locations for 80GB A100.

Are you using batch_size=None?

getting a speed of about 10 seconds per iteration

This is very slow. Could you confirm that the GPU is used?

throwing additional a40 cores at the issue didn't seem to improve speed, I was using 8 with my latest run

Cell2location doesn't support using multiple GPUs - only one GPU was used.

it was done training in only a few hours

This sounds expected.

Dillon214 commented 4 months ago

Hi Viktl,

Thanks for the speedy reply. And no, I was adjusting the batch_size argument to avoid out of memory errors. I'm no expert, but I'm guessing the size of the dataset exceeds the memory capacity of a single GPU. And yes, I can confirm that a GPU is being used. See attached image.

Based on the suggestions in the issue you posted, it seems like splitting the dataset into chunks, stratified by batch and other relevant variables perhaps, is a good idea for evading slowdown issues. Do you still reccomend this? I tested this a bit, and it seemed to speed up processing dramatically. I suppose afterwards the results can be re-merged.

-Dillon

vitkl commented 4 months ago

Using batch_size other than batch_size=None is indeed expected to take between many days and more than a week (10 seconds per epoch is pretty fast). More importantly, training with batch_size equal to full data gives higher accuracy - so we don't recommend using minibatch training.

it seems like splitting the dataset into chunks, stratified by batch and other relevant variables perhaps, is a good idea for evading slowdown issues. Do you still reccomend this?

Yes, but you also need to set batch_size=None.

With batch_size=None the entire dataset is loaded into GPU memory once, while with minibatch training batch_size=X X number of observations are loaded into GPU memory in every training step.

Dillon214 commented 4 months ago

Hi Vitkl,

I followed the code you posted in that other thread for subsetting and individually processing multiple objects, and experienced much better training times. One final question: say I want to now compute expected expression per cell type, as was done in the tutorial pictured below and at the following link: https://cell2location.readthedocs.io/en/latest/notebooks/cell2location_tutorial.html#Estimate-cell-type-specific-expression-of-every-gene-in-the-spatial-data-(needed-for-NCEM).

Would it be fine to perform this process for each individually processed object, then merge the results, or do you think a different approach should be taken?

-Dillon