crangelsmith commented 9 months ago

278

[x] Implementing fabric pytorch lightning for device management
[x] Added reproducibility tests for data loading and model from a given input.
[x] Add an example script to run in Baskerville

This has been tested thoroughly on 1 GPU in Baskerville, and locally (where default is CPU). There are some issues when trying to use multiple GPUs in one job, where sometimes this issue is encountered. It is not clear yet why and we are investigating (suspect to be a Baskerville slurm environment-related), but as currently, 1 GPU is enough for the calculations it should not affect our progress.

For review:

It would be useful of the reviewer adapted the tools/slurm_run.sh script and sent a job in baskerville for a preferred dataset.

marjanfamili commented 8 months ago

Quick question, there isn't an option to choose if the user want to use fabric or not ? is this important ? from what I can see in the test that I have done, on a single GPU the performance doesn't change significantly , is this why there isn't an option ?

crangelsmith commented 8 months ago

Quick question, there isn't an option to choose if the user want to use fabric or not ? is this important ? from what I can see in the test that I have done, on a single GPU the performance doesn't change significantly , is this why there isn't an option ?

If you are running on a single GPU is basically like not using fabric, as its code is implementing the default pytorch device handling under the hood. This would change if you try to use multiple GPUSs or different strategies (ddp, fsdp).

Have you tried to running 2 GPUs with ddp?, in my benchmarking this halfs the time of running (if Baskerville allows it..).

alan-turing-institute / affinity-vae

Pytorch lightning fabric #279

278