Multiple GPUs provision

cbagwe commented 1 month ago

Currently, in config.yaml, only one GPU is accepted as an integer. It would be great to add multiple GPUs here to speed up the process.

Thanks, Chaitali

SimWdm commented 1 month ago

Hi Chaitali,

thanks für reaching out! I will implement multi GPU fitting in a separate branch early next week. It should be fairly easy to do.

Best, Simon

SimWdm commented 1 month ago

Hi Chaitali,

I have been looking into this today but it seems more complicated than expected due to my use of a custom sampler in the data loader which is not compatible with PyTorch Lightning's DDP. I will post here again once I have updates.

If you have experience with such things, please feel free to share your thoughts.

Best, Simon

cbagwe commented 3 weeks ago

Hi Simon,

Thank you for looking into it. Eagerly waiting for update.

Regards, Chaitali

SimWdm commented 2 weeks ago

Hi @cbagwe and everyone who's interested,

sorry that this is taking a bit longer than expected but I think I have found a way to make multi GPU training work with the custom data loader. I am currently re-running the tutorial on 2 GPUs to see if everything works as expected.

I will post here again once I have any updates!

Best, Simon

SimWdm commented 2 weeks ago

The test run has finished successfully! 🎉

You can test the multi GPU version on the new branch 15-multiple-gpus-provision. To switch to the new branch, you can use git fetch, followed by git checkout 15-multiple-gpus-provision.

Now the gpu argument in the config.yaml file can also be a list of GPUs, e.g. gpu: [0,1]. You can also specify which distributed backend to use by setting distributed_backend (default is nccl which is also PyTorch Lightning's default and should be a good choice for most cases).

@cbagwe

Edit: Please let me know if the multi GPU branch works as expected of if there are any issues or things that can be improved. Once we're satisfied with the multi GPU version, I'll merge it into main.

cbagwe commented 2 weeks ago

Hi @SimWdm,

Sorry for late reply as I was on vacation. I ran into the following error while fitting data:

/Users/bagwe/miniforge3/envs/ddw_env/bin/ddw:8 in <module>                                       │
│                                                                                                  │
│   5 from ddw.app import main                                                                     │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(main())                                                                         │
│   9                                                                                              │
│                                                                                                  │
│ /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/ddw/app.py:15 in main          │
│                                                                                                  │
│   12                                                                                             │
│   13                                                                                             │
│   14 def main():                                                                                 │
│ ❱ 15 │   app()                                                                                   │
│   16                                                                                             │
│   17                                                                                             │
│   18 if __name__ == "__main__":                                                                  │
│                                                                                                  │
│ /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/typer/main.py:326 in __call__  │
│                                                                                                  │
│ /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/typer/main.py:309 in __call__  │
│                                                                                                  │
│ /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:1157 in __call__ │
│                                                                                                  │
│ /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/typer/core.py:723 in main      │
│                                                                                                  │
│ /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/typer/core.py:193 in _main     │
│                                                                                                  │
│ /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:1686 in invoke   │
│                                                                                                  │
│ /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:943 in           │
│ make_context                                                                                     │
│                                                                                                  │
│ /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:1408 in          │
│ parse_args                                                                                       │
│                                                                                                  │
│ /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:2400 in          │
│ handle_parse_result                                                                              │
│                                                                                                  │
│ /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:2356 in          │
│ process_value                                                                                    │
│                                                                                                  │
│ /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:2344 in          │
│ type_cast_value                                                                                  │
│                                                                                                  │
│ /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:2316 in convert  │
│                                                                                                  │
│ /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/types.py:83 in __call__  │
│                                                                                                  │
│ /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/types.py:411 in convert  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'

Our research institute has a strict policy against the use of conda, so I am using mamba/miniforge. We can fit the model when gpu requirement in config.yaml file is changed to an integer.

I am trying to figure out whether this is a dependency error because of use of mamba or something else.

Hope you could help me with this.

Regards, Chaitali

SimWdm commented 2 weeks ago

Hi Chaitali,

that's strange, I don't get this error in my local copy. Did you re-install ddw after switching branches?

Best, Simon

MLI-lab / DeepDeWedge

Multiple GPUs provision #15