Open mvinyard opened 1 year ago
Hello @mvinyard ,
Actually there is no reference to keops or pykeops in your error stack, so it is really hard to tell right now.
Maybe you could post a minimal example ?
Otherwise my first thinking is that this problem might be not easy to solve and have to do with the way we handle devices in the nvrtc code. This is done in file keopscore/binders/nvrtc/keops_nvrtc.cpp
, around lines 290 and below. Here we use Cuda driver functions to handle devices, modules, contexts, etc. This might be in conflict with what pytorch lightning does in the background.
I am working with models in PyTorch Lightning. I'm using the
geomloss.SamplesLoss
function from @jeanfeydy. A single GPU works without issue. Unfortunately, importing any portion of thepykeops
package seems to cause the following error (see below) when using more than a single GPU.Within models built using
pytorch_lightning
, which - for the uninitiated - is simply a subclass oftorch.nn.Module
, devices assignments are handled automatically. Thus, external / conflicting device assignments throw the error I show below. While this error must boil down to the waytorch.Tensor
objects are placed onto the GPU withinpykeops
, it's not immediately obvious to me how I might override/edit things to fix this. For example, in my own code if I have the following snippet:I can simply replace this with the corresponding
lightning
device assignment:These parts may be irrelevant to you but I am hoping to dig deeper into
pykeops
such that I might be able to write some manual overrides to make this work within a multi-GPU context usingpytorch_lightning
. To that end, I am wondering if you have any suggestions for things to try first - meaning, I don't really know what withinpykeops
is perhaps the culprit.Error message
``` ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 133, in _wrapping_function results = function(*args, **kwargs) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1102, in _run self.strategy.setup_environment() File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 130, in setup_environment self.accelerator.setup_environment(self.root_device) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/pytorch_lightning/accelerators/cuda.py", line 45, in setup_environment torch.cuda.set_device(root_device) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/torch/cuda/__init__.py", line 314, in set_device torch._C._cuda_setDevice(device) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/torch/cuda/__init__.py", line 207, in _lazy_init raise RuntimeError( RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method ```