Open cbagwe opened 1 month ago
Hi Chaitali,
thanks fΓΌr reaching out! I will implement multi GPU fitting in a separate branch early next week. It should be fairly easy to do.
Best, Simon
Hi Chaitali,
I have been looking into this today but it seems more complicated than expected due to my use of a custom sampler in the data loader which is not compatible with PyTorch Lightning's DDP. I will post here again once I have updates.
If you have experience with such things, please feel free to share your thoughts.
Best, Simon
Hi Simon,
Thank you for looking into it. Eagerly waiting for update.
Regards, Chaitali
Hi @cbagwe and everyone who's interested,
sorry that this is taking a bit longer than expected but I think I have found a way to make multi GPU training work with the custom data loader. I am currently re-running the tutorial on 2 GPUs to see if everything works as expected.
I will post here again once I have any updates!
Best, Simon
The test run has finished successfully! π
You can test the multi GPU version on the new branch 15-multiple-gpus-provision. To switch to the new branch, you can use git fetch
, followed by git checkout 15-multiple-gpus-provision
.
Now the gpu
argument in the config.yaml
file can also be a list of GPUs, e.g. gpu: [0,1]
. You can also specify which distributed backend to use by setting distributed_backend
(default is nccl
which is also PyTorch Lightning's default and should be a good choice for most cases).
@cbagwe
Edit: Please let me know if the multi GPU branch works as expected of if there are any issues or things that can be improved. Once we're satisfied with the multi GPU version, I'll merge it into main
.
Hi @SimWdm,
Sorry for late reply as I was on vacation. I ran into the following error while fitting data:
/Users/bagwe/miniforge3/envs/ddw_env/bin/ddw:8 in <module> β
β β
β 5 from ddw.app import main β
β 6 if __name__ == '__main__': β
β 7 β sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) β
β β± 8 β sys.exit(main()) β
β 9 β
β β
β /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/ddw/app.py:15 in main β
β β
β 12 β
β 13 β
β 14 def main(): β
β β± 15 β app() β
β 16 β
β 17 β
β 18 if __name__ == "__main__": β
β β
β /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/typer/main.py:326 in __call__ β
β β
β /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/typer/main.py:309 in __call__ β
β β
β /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:1157 in __call__ β
β β
β /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/typer/core.py:723 in main β
β β
β /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/typer/core.py:193 in _main β
β β
β /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:1686 in invoke β
β β
β /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:943 in β
β make_context β
β β
β /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:1408 in β
β parse_args β
β β
β /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:2400 in β
β handle_parse_result β
β β
β /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:2356 in β
β process_value β
β β
β /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:2344 in β
β type_cast_value β
β β
β /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/core.py:2316 in convert β
β β
β /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/types.py:83 in __call__ β
β β
β /Users/bagwe/miniforge3/envs/ddw_env/lib/python3.10/site-packages/click/types.py:411 in convert β
β°βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ―
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'
Our research institute has a strict policy against the use of conda, so I am using mamba/miniforge. We can fit the model when gpu requirement in config.yaml file is changed to an integer.
I am trying to figure out whether this is a dependency error because of use of mamba or something else.
Hope you could help me with this.
Regards, Chaitali
Hi Chaitali,
that's strange, I don't get this error in my local copy. Did you re-install ddw
after switching branches?
Best, Simon
Currently, in config.yaml, only one GPU is accepted as an integer. It would be great to add multiple GPUs here to speed up the process.
Thanks, Chaitali