pykeops breaks multi-GPU training when using PyTorch Lightning

I am working with models in PyTorch Lightning. I'm using the geomloss.SamplesLoss function from @jeanfeydy. A single GPU works without issue. Unfortunately, importing any portion of the pykeops package seems to cause the following error (see below) when using more than a single GPU.

Within models built using pytorch_lightning, which - for the uninitiated - is simply a subclass of torch.nn.Module, devices assignments are handled automatically. Thus, external / conflicting device assignments throw the error I show below. While this error must boil down to the way torch.Tensor objects are placed onto the GPU within pykeops, it's not immediately obvious to me how I might override/edit things to fix this. For example, in my own code if I have the following snippet:

my_tensor.to("cuda:0")

I can simply replace this with the corresponding lightning device assignment:

my_tensor.to(model.device)

These parts may be irrelevant to you but I am hoping to dig deeper into pykeops such that I might be able to write some manual overrides to make this work within a multi-GPU context using pytorch_lightning. To that end, I am wondering if you have any suggestions for things to try first - meaning, I don't really know what within pykeops is perhaps the culprit.

Error message

``` ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 133, in _wrapping_function results = function(*args, **kwargs) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1102, in _run self.strategy.setup_environment() File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 130, in setup_environment self.accelerator.setup_environment(self.root_device) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/pytorch_lightning/accelerators/cuda.py", line 45, in setup_environment torch.cuda.set_device(root_device) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/torch/cuda/__init__.py", line 314, in set_device torch._C._cuda_setDevice(device) File "/home/mvinyard/.anaconda3/envs/sdq/lib/python3.9/site-packages/torch/cuda/__init__.py", line 207, in _lazy_init raise RuntimeError( RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method ```

getkeops / keops

pykeops breaks multi-GPU training when using PyTorch Lightning #272