alan-turing-institute / affinity-vae

Self-supervised method for disentanglement, clustering and classification of objects in multidimensional image data
BSD 3-Clause "New" or "Revised" License
13 stars 7 forks source link

Pytorch lightning fabric #279

Closed crangelsmith closed 8 months ago

crangelsmith commented 9 months ago

278

This has been tested thoroughly on 1 GPU in Baskerville, and locally (where default is CPU). There are some issues when trying to use multiple GPUs in one job, where sometimes this issue is encountered. It is not clear yet why and we are investigating (suspect to be a Baskerville slurm environment-related), but as currently, 1 GPU is enough for the calculations it should not affect our progress.

For review:

marjanfamili commented 8 months ago

Quick question, there isn't an option to choose if the user want to use fabric or not ? is this important ? from what I can see in the test that I have done, on a single GPU the performance doesn't change significantly , is this why there isn't an option ?

crangelsmith commented 8 months ago

Quick question, there isn't an option to choose if the user want to use fabric or not ? is this important ? from what I can see in the test that I have done, on a single GPU the performance doesn't change significantly , is this why there isn't an option ?

If you are running on a single GPU is basically like not using fabric, as its code is implementing the default pytorch device handling under the hood. This would change if you try to use multiple GPUSs or different strategies (ddp, fsdp).

Have you tried to running 2 GPUs with ddp?, in my benchmarking this halfs the time of running (if Baskerville allows it..).