getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.03k stars 65 forks source link

Unable to compile formula with error: CUDA_ERROR_INVALID_DEVICE #239

Closed mbanani closed 2 years ago

mbanani commented 2 years ago

Hello,

Thank you very much for the excellent library. It has really helped me with my research. I recently moved to a new compute cluster and I can't seem to get it to work anymore though. I would appreciate any advice.

Some specs: python 3.9 gcc 11.2 cuda 11.3 cmake 3.22.1 NVIDIA A40 GPUs (Driver Version: 495.44)

I tried python -m pip install pykeops to install pykeops 2.0, and got the following error:

$ python -c "import pykeops; pykeops.clean_pykeops(); pykeops.test_numpy_bindings()"
[KeOps] /home/mbanani/.cache/keops2.0/build_CUDA_VISIBLE_DEVICES_0 has been cleaned.
[KeOps] Compiling cuda jit compiler engine ... OK
[pyKeOps] Compiling nvrtc binder for python ... OK
[KeOps] Generating code for formula Sum_Reduction((Var(0,3,0)-Var(1,3,1))|(Var(0,3,0)-Var(1,3,1)),1) ... OK

[KeOps] error: cuDevicePrimaryCtxRetain(&ctx, cuDevice) failed with error CUDA_ERROR_INVALID_DEVICE

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/mbanani/miniconda3/envs/pykeops/lib/python3.9/site-packages/pykeops/numpy/test_install.py", line 20, in test_numpy_bindings
    if np.allclose(my_conv(x, y).flatten(), expected_res):
  File "/home/mbanani/miniconda3/envs/pykeops/lib/python3.9/site-packages/pykeops/numpy/generic/generic_red.py", line 303, in __call__
    self.myconv = keops_binder["nvrtc" if tagCPUGPU else "cpp"](
  File "/home/mbanani/miniconda3/envs/pykeops/lib/python3.9/site-packages/keopscore/utils/Cache.py", line 68, in __call__
    obj = self.cls(*args)
  File "/home/mbanani/miniconda3/envs/pykeops/lib/python3.9/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 15, in __init__
    super().__init__(*args, fast_init=fast_init)
  File "/home/mbanani/miniconda3/envs/pykeops/lib/python3.9/site-packages/pykeops/common/keops_io/LoadKeOps.py", line 31, in __init__
    self.init_phase2()
  File "/home/mbanani/miniconda3/envs/pykeops/lib/python3.9/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 23, in init_phase2
    self.launch_keops = pykeops_nvrtc.KeOps_module_float(
RuntimeError: [KeOps] Cuda error.
python -c "import pykeops; pykeops.clean_pykeops(); pykeops.test_torch_bindings()"
[KeOps] /home/mbanani/.cache/keops2.0/build_CUDA_VISIBLE_DEVICES_0 has been cleaned.
[KeOps] Compiling cuda jit compiler engine ... OK
[pyKeOps] Compiling nvrtc binder for python ... OK
[KeOps] Generating code for formula Sum_Reduction((Var(0,3,0)-Var(1,3,1))|(Var(0,3,0)-Var(1,3,1)),1) ... OK

[KeOps] error: cuDevicePrimaryCtxRetain(&ctx, cuDevice) failed with error CUDA_ERROR_INVALID_DEVICE

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/mbanani/miniconda3/envs/pykeops/lib/python3.9/site-packages/pykeops/torch/test_install.py", line 21, in test_torch_bindings
    my_conv(x, y).view(-1), torch.tensor(expected_res).type(torch.float32)
  File "/home/mbanani/miniconda3/envs/pykeops/lib/python3.9/site-packages/pykeops/torch/generic/generic_red.py", line 624, in __call__
    out = GenredAutograd.apply(
  File "/home/mbanani/miniconda3/envs/pykeops/lib/python3.9/site-packages/pykeops/torch/generic/generic_red.py", line 78, in forward
    myconv = keops_binder["nvrtc" if tagCPUGPU else "cpp"](
  File "/home/mbanani/miniconda3/envs/pykeops/lib/python3.9/site-packages/keopscore/utils/Cache.py", line 68, in __call__
    obj = self.cls(*args)
  File "/home/mbanani/miniconda3/envs/pykeops/lib/python3.9/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 15, in __init__
    super().__init__(*args, fast_init=fast_init)
  File "/home/mbanani/miniconda3/envs/pykeops/lib/python3.9/site-packages/pykeops/common/keops_io/LoadKeOps.py", line 31, in __init__
    self.init_phase2()
  File "/home/mbanani/miniconda3/envs/gencon/lib/python3.9/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 23, in init_phase2
    self.launch_keops = pykeops_nvrtc.KeOps_module_float(
RuntimeError: [KeOps] Cuda error.

I would really appreciate any advice on what might be causing this. Thank you.

jeanfeydy commented 2 years ago

Hi @mbanani,

Thanks for your kind words, and apologies for the long response time. I hope that you could find a workaround over the last few weeks :-/

To answer your question: just to be clear, could you provide us with the output of nvidia-smi in your environment? And maybe information about your GPU devices, as seen by torch.cuda?

@bcharlier: I remember that you told me that you encountered a problem with a specific GPU (A10?) + CUDA version. Could this be related to the same issue?

Best regards, Jean

mbanani commented 2 years ago

Hi @jeanfeydy, thank you for the response and sorry for the late response. I am not sure what happened, but I just did a fresh install of pykeops in a new conda environment and things seem to work fine. My apologies for the confusion.