BayraktarLab / cell2location

Comprehensive mapping of tissue cell architecture via integrated single cell and spatial transcriptomics (cell2location model)
https://cell2location.readthedocs.io/en/latest/
Apache License 2.0
320 stars 58 forks source link

Multi-GPU training #106

Open mehdiborji opened 2 years ago

mehdiborji commented 2 years ago

I am trying cell2loc on GCP with multi-GPUs. It seems to be using only one. How can I train on multiple GPUs. Multiple cheaper GPUs like K80 potentially can fit bigger spatial data. If you have a recommended GPU setting for many thousands of spots, I'd be happy to hear about it.

vitkl commented 2 years ago

Hi Mehdi

I assume that you mean data parallelism over multiple GPUs, e.i. putting data for distinct locations and local parameters (e.g. cell abundance) on different GPUs devices. This is not possible at the moment. Using multiple GPUs with pyro+scvi-tools requires some non-trivial coding (some discussion here: https://github.com/YosefLab/scvi-tools/issues/1226). If you know how to implement this feature please consider contributing to cell2location & scvi-tools.

Happy holidays!

Vitalii

On Thu, 23 Dec 2021, 19:57 Mehdi Borji, @.***> wrote:

I am trying cell2loc on GCP with multi-GPUs. It seems to be using only one. How can I train on multiple GPUs.

— Reply to this email directly, view it on GitHub https://github.com/BayraktarLab/cell2location/issues/106, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMFTV4BJO4MFD4QQRPJRRLUSN5MDANCNFSM5KVNAIYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

kuang-da commented 1 year ago

Hi there,

any plan for this issue since https://github.com/scverse/scvi-tools/issues/1226 has been closed recently?

Also, I wonder how to estimate the memory that training the model takes. I am a bit surprised the fact that 49 GB GPU-RAM is not enough for my Visium matrix in the shape of 97333x16928 float32.

vitkl commented 1 year ago

The distributed sampler in https://github.com/scverse/scvi-tools/issues/1226 addresses the case when in every training step a subset of the data is loaded into GPU memory (eg 100 cells). In that case, when you have eg 4GPUs the sampler splits 100 cells into 4 chunks of 25 cells and loads each chunk to each GPU. This form of parallelization is mainly useful for very large models (eg transformers) and when individual data batches must be very large. For simpler models on tabular gene expression or ATAC data (eg factorisation, VAE) minibatches and the model are not large enough to require this mode of parallelization.

The main cell2location operation is by training the model on all data rather than minibatches of data - which improves accuracy. So far I have not seen any attempt that delivers the same accuracy using minibatch training, with or without amortisation (using NN encoders). Cell2location package provides minibatch training both with or without amortisation but at lower accuracy.

Multi-GPU setup could allow the loading of larger datasets. The problem is that the data needs to be loaded once and kept on multiple GPUs, while parameters updated across GPUs. This is not supported by standard multi-GPU workflows in PyTorch Lightning and is a fairly niche application. It should be possible to modify DeviceBackedDataSplitter https://github.com/BayraktarLab/cell2location/blob/master/cell2location/models/_cell2location_model.py#L263-L268 and add other code to make PyTorch Lightning DDP work with this setup. However, I don't know how to do that and it's a bigger project than I can currently commit to. Contributions are welcome.

kuang-da commented 1 year ago

Thank you for the detailed explanation!

Multi-GPU setup could allow the loading of larger datasets. The problem is that the data needs to be loaded once and kept on multiple GPUs, while parameters updated across GPUs. This is not supported by standard multi-GPU workflows in PyTorch Lightning and is a fairly niche application.

Your insights make a lot of sense. As a workaround, I'm considering using an EC2 instance with an A100 GPU for my project.

That being said, given the growing scale of Visium Slides in large cell atlas projects like HuBMAP and HCA, I believe the following enhancements could benefit the user community:

(1) Introduce a feature in mod.view_anndata_setup() to estimate the model size. (2) Allow the option to cast data types from float32 to float16 to save on memory usage.

I'd be more than happy to contribute to the development of these features if you find them to be worthwhile additions.

Once again, thank you for developing and maintaining such a valuable computational tool for spatial data!

vitkl commented 1 year ago

Thanks for the suggestions!

Point 1 is quite practical. I have been working off the back-off-the-envelope calculations that you need A100 80GB to use 50k locations 16k genes dataset and V100 32GB for 16k 16k dataset. To quantitatively predict this, we would need to run systematic experiments but GPU use is harder to track (I am experimenting with wandb at the moment). Feel free to contribute.

We tested point 2 a few years ago. It didn't work due to numerical accuracy issues (training fails with NaN) or even to a lack of 16-bit implementation of NB likelihood special functions (I don't remember which issue it was in PyTorch). My guess is that float16 and bfloat16 work well when all parameters are applied to scaled data - which is not the case in models of this type. I would be surprised in scVI worked with 16bit. This is very easy to try - simply adding an argument to model.train(..., precision="bf16") or model.train(..., precision=16).

We also have very large projects in the lab (100+ sections). So far the approach has been to apply cell2location in chunks of several sections that fit into A100 or V100 GPU.

A potentially better idea, which we plan to test soon, is to randomly split data into N chunks that each fit into one GPU (80GB A100 GPU), stratifying by experimental batch/section every analysis chunk contains locations from all sections, to get about 50k locations for 15k genes. Then simply merge the resulting cell abundance estimates into one object (it would help to have methods that correctly merge adata.uns['mod']).

kuang-da commented 1 year ago

Thank you for the suggestions. The heuristic estimation about 80GB and 32 GB Memory is very useful.

A potentially better idea, which I would test soon, is to randomly split data into N chunks that each fit into one GPU (80GB A100 GPU), stratifying by experimental batch/section every analysis chunk contains locations from all sections, to get about 50k locations for 15k genes. Then simply merge the resulting cell abundance estimates into one object (it would help to have methods that correctly merge adata.uns['mod']).

This is a great idea and it would be even more efficient if the stratified chunks can be dispatched to multiple GPUs.

vitkl commented 1 year ago

Multi-GPU joint training is nice but a bit more complicated to implement.

Here I suggest training N independent model but on stratified data chunks. You can run N independent jobs in parallel, and then merge results at the end of training. This is easier to realise as a separate training script rather than a part of the package (maybe a command-line script - but not changing standard training interface).

kuang-da commented 1 year ago

This is easier to realise as a separate training script rather than a part of the package

That makes sense. Thank you for all the guidance!