getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.03k stars 65 forks source link

pykeops breaks multi-GPU training when using PyTorch Lightning #272

Open mvinyard opened 1 year ago

mvinyard commented 1 year ago

I am working with models in PyTorch Lightning. I'm using the geomloss.SamplesLoss function from @jeanfeydy. A single GPU works without issue. Unfortunately, importing any portion of the pykeops package seems to cause the following error (see below) when using more than a single GPU.

Within models built using pytorch_lightning, which - for the uninitiated - is simply a subclass of torch.nn.Module, devices assignments are handled automatically. Thus, external / conflicting device assignments throw the error I show below. While this error must boil down to the way torch.Tensor objects are placed onto the GPU within pykeops, it's not immediately obvious to me how I might override/edit things to fix this. For example, in my own code if I have the following snippet:

my_tensor.to("cuda:0")

I can simply replace this with the corresponding lightning device assignment:

my_tensor.to(model.device)

These parts may be irrelevant to you but I am hoping to dig deeper into pykeops such that I might be able to write some manual overrides to make this work within a multi-GPU context using pytorch_lightning. To that end, I am wondering if you have any suggestions for things to try first - meaning, I don't really know what within pykeops is perhaps the culprit.

Error message ``` ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 133, in _wrapping_function results = function(*args, **kwargs) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1102, in _run self.strategy.setup_environment() File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 130, in setup_environment self.accelerator.setup_environment(self.root_device) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/pytorch_lightning/accelerators/cuda.py", line 45, in setup_environment torch.cuda.set_device(root_device) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/torch/cuda/__init__.py", line 314, in set_device torch._C._cuda_setDevice(device) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/torch/cuda/__init__.py", line 207, in _lazy_init raise RuntimeError( RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method ```
joanglaunes commented 1 year ago

Hello @mvinyard , Actually there is no reference to keops or pykeops in your error stack, so it is really hard to tell right now. Maybe you could post a minimal example ? Otherwise my first thinking is that this problem might be not easy to solve and have to do with the way we handle devices in the nvrtc code. This is done in file keopscore/binders/nvrtc/keops_nvrtc.cpp, around lines 290 and below. Here we use Cuda driver functions to handle devices, modules, contexts, etc. This might be in conflict with what pytorch lightning does in the background.