Cuda error: cuModuleLoadDataEx(&module, target, 0, NULL, NULL) failed with error CUDA_ERROR_INVALID_IMAGE #361

Open ZkzMMDC opened 7 months ago

ZkzMMDC commented 7 months ago

python 3.8 ,cuda 11.2, GPU RTX4090

When I run the following test “pykeops.test_torch_bindings() ” to make sure Keops work:

[KeOps] Generating code for Sum_Reduction reduction (with parameters 1) of formula Sum((a-b)**2) with a=Var(0,3,0), b=Var(1,3,1) ... OK

[KeOps] error: cuModuleLoadDataEx(&module, target, 0, NULL, NULL) failed with error CUDA_ERROR_INVALID_IMAGE

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zkz/anaconda3/envs/robot/lib/python3.8/site-packages/pykeops/torch/", line 21, in test_torch_bindings
    my_conv(x, y).view(-1), torch.tensor(expected_res).type(torch.float32)
  File "/home/zkz/anaconda3/envs/robot/lib/python3.8/site-packages/pykeops/torch/generic/", line 687, in __call__
    out = GenredAutograd_fun(params, *args)
  File "/home/zkz/anaconda3/envs/robot/lib/python3.8/site-packages/pykeops/torch/generic/", line 271, in forward
    outputs = GenredAutograd_base._forward(*inputs)
  File "/home/zkz/anaconda3/envs/robot/lib/python3.8/site-packages/pykeops/torch/generic/", line 91, in _forward
    myconv = keops_binder["nvrtc" if tagCPUGPU else "cpp"](
  File "/home/zkz/anaconda3/envs/robot/lib/python3.8/site-packages/keopscore/utils/", line 91, in __call__
    obj = self.cls(*args)
  File "/home/zkz/anaconda3/envs/robot/lib/python3.8/site-packages/pykeops/common/keops_io/", line 16, in __init__
    super().__init__(*args, fast_init=fast_init)
  File "/home/zkz/anaconda3/envs/robot/lib/python3.8/site-packages/pykeops/common/keops_io/", line 31, in __init__
  File "/home/zkz/anaconda3/envs/robot/lib/python3.8/site-packages/pykeops/common/keops_io/", line 29, in init_phase2
    self.launch_keops = pykeops_nvrtc.KeOps_module_float(
RuntimeError: [KeOps] Cuda error.
wang-jh18-SVM commented 6 months ago

I'm encountering the same CUDA_ERROR_INVALID_IMAGE error when running KeOps with pytorch bindings. Below are the steps to reproduce the error, the full error message, and my system information.

Steps to Reproduce:

import pykeops

Error Message:

[KeOps] /root/.cache/keops2.2.2/Linux_autodl-container-758f438c9a-33381152_5.15.0-91-generic_p3.8.18 has been cleaned.
[KeOps] Compiling cuda jit compiler engine ... OK
[pyKeOps] Compiling nvrtc binder for python ... OK
[KeOps] Generating code for Sum_Reduction reduction (with parameters 1) of formula Sum((a-b)**2) with a=Var(0,3,0), b=Var(1,3,1) ... OK

[KeOps] error: cuModuleLoadDataEx(&module, target, 0, NULL, NULL) failed with error CUDA_ERROR_INVALID_IMAGE

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/root/miniconda3/envs/test/lib/python3.8/site-packages/pykeops/torch/", line 21, in test_torch_bindings
    my_conv(x, y).view(-1), torch.tensor(expected_res).type(torch.float32)
  File "/root/miniconda3/envs/test/lib/python3.8/site-packages/pykeops/torch/generic/", line 687, in __call__
    out = GenredAutograd_fun(params, *args)
  File "/root/miniconda3/envs/test/lib/python3.8/site-packages/pykeops/torch/generic/", line 271, in forward
    outputs = GenredAutograd_base._forward(*inputs)
  File "/root/miniconda3/envs/test/lib/python3.8/site-packages/pykeops/torch/generic/", line 91, in _forward
    myconv = keops_binder["nvrtc" if tagCPUGPU else "cpp"](
  File "/root/miniconda3/envs/test/lib/python3.8/site-packages/keopscore/utils/", line 91, in __call__
    obj = self.cls(*args)
  File "/root/miniconda3/envs/test/lib/python3.8/site-packages/pykeops/common/keops_io/", line 16, in __init__
    super().__init__(*args, fast_init=fast_init)
  File "/root/miniconda3/envs/test/lib/python3.8/site-packages/pykeops/common/keops_io/", line 31, in __init__
  File "/root/miniconda3/envs/test/lib/python3.8/site-packages/pykeops/common/keops_io/", line 29, in init_phase2
    self.launch_keops = pykeops_nvrtc.KeOps_module_float(
RuntimeError: [KeOps] Cuda error.

The above error suggests there might be an issue with the CUDA image. I've made sure to clean the KeOps cache before testing the torch bindings.

System Information:

Full Conda List:

Has anyone else experienced a similar issue or can provide insight into what might be causing the CUDA_ERROR_INVALID_IMAGE error with KeOps on an RTX 4090 GPU?

jeanfeydy commented 6 months ago

Hi @ZkzMMDC , @wang-jh18-SVM ,

Thanks for your interest in our library, and the detailed reports. I don't have a RTX4090 at hand to try this myself, but I'll be very surprised if this turns out to be a hardware issue. KeOps runs fine on all the Nvidia GPUs that we've had access to since 2017, it does not rely on niche instruction sets.

As far as I can tell, the most likely issue here is that in @wang-jh18-SVM 's report, cudatoolkit==11.6.2 while nvcc==1.8: the vast majority of KeOps installation issues happen on systems where several versions of CUDA are available, and we somehow mix up the paths.

Could you maybe:

Best regards, Jean

bcharlier commented 6 months ago

I would recommend to use cuda >= 12 with recent hardware (though I am not sure it fixes this particular issue). The cuda version used by keOps could be different from the one used by torch. You may install cuda locally and force the cuda used by keops by setting the env 'CUDA_PATH'.

Yangr116 commented 5 months ago

The same issue, I use the pykeops docker.

soulofxin commented 5 months ago

When I run the following test “pykeops.test_torch_bindings() ” to make sure Keops work:

have you solved that problem?

wang-jh18-SVM commented 5 months ago

Hi, I still don't know why, but when I install pytorch with pip rather than conda, it works.