Open ZkzMMDC opened 7 months ago
I'm encountering the same CUDA_ERROR_INVALID_IMAGE
error when running KeOps with pytorch bindings. Below are the steps to reproduce the error, the full error message, and my system information.
Steps to Reproduce:
import pykeops
pykeops.clean_pykeops()
pykeops.test_torch_bindings()
Error Message:
[KeOps] /root/.cache/keops2.2.2/Linux_autodl-container-758f438c9a-33381152_5.15.0-91-generic_p3.8.18 has been cleaned.
[KeOps] Compiling cuda jit compiler engine ... OK
[pyKeOps] Compiling nvrtc binder for python ... OK
[KeOps] Generating code for Sum_Reduction reduction (with parameters 1) of formula Sum((a-b)**2) with a=Var(0,3,0), b=Var(1,3,1) ... OK
[KeOps] error: cuModuleLoadDataEx(&module, target, 0, NULL, NULL) failed with error CUDA_ERROR_INVALID_IMAGE
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/pykeops/torch/test_install.py", line 21, in test_torch_bindings
my_conv(x, y).view(-1), torch.tensor(expected_res).type(torch.float32)
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/pykeops/torch/generic/generic_red.py", line 687, in __call__
out = GenredAutograd_fun(params, *args)
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/pykeops/torch/generic/generic_red.py", line 271, in forward
outputs = GenredAutograd_base._forward(*inputs)
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/pykeops/torch/generic/generic_red.py", line 91, in _forward
myconv = keops_binder["nvrtc" if tagCPUGPU else "cpp"](
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/keopscore/utils/Cache.py", line 91, in __call__
obj = self.cls(*args)
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 16, in __init__
super().__init__(*args, fast_init=fast_init)
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/pykeops/common/keops_io/LoadKeOps.py", line 31, in __init__
self.init_phase2()
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 29, in init_phase2
self.launch_keops = pykeops_nvrtc.KeOps_module_float(
RuntimeError: [KeOps] Cuda error.
The above error suggests there might be an issue with the CUDA image. I've made sure to clean the KeOps cache before testing the torch bindings.
System Information:
nvcc -V
)Full Conda List:
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
blas 1.0 mkl conda-forge
brotli-python 1.1.0 py38h17151c0_1 conda-forge
bzip2 1.0.8 hd590300_5 conda-forge
ca-certificates 2024.2.2 hbcca054_0 conda-forge
certifi 2024.2.2 pyhd8ed1ab_0 conda-forge
charset-normalizer 3.3.2 pyhd8ed1ab_0 conda-forge
cudatoolkit 11.6.2 hfc3e2af_13 conda-forge
ffmpeg 4.3 hf484d3e_0 pytorch
freetype 2.10.4 h0708190_1 conda-forge
gmp 6.3.0 h59595ed_1 conda-forge
gnutls 3.6.13 h85f3911_1 conda-forge
icu 73.2 h59595ed_0 conda-forge
idna 3.6 pyhd8ed1ab_0 conda-forge
intel-openmp 2023.1.0 hdb19cb5_46306 defaults
jbig 2.1 h7f98852_2003 conda-forge
jpeg 9e h0b41bf4_3 conda-forge
keopscore 2.2.2 pypi_0 pypi
lame 3.100 h166bdaf_1003 conda-forge
lcms2 2.12 hddcbb42_0 conda-forge
ld_impl_linux-64 2.38 h1181459_1 defaults
lerc 2.2.1 h9c3ff4c_0 conda-forge
libblas 3.9.0 1_h86c2bf4_netlib conda-forge
libcblas 3.9.0 5_h92ddd45_netlib conda-forge
libdeflate 1.7 h7f98852_5 conda-forge
libffi 3.4.4 h6a678d5_0 defaults
libgcc-ng 13.2.0 h807b86a_5 conda-forge
libgfortran-ng 13.2.0 h69a702a_5 conda-forge
libgfortran5 13.2.0 ha4646dd_5 conda-forge
libgomp 13.2.0 h807b86a_5 conda-forge
libhwloc 2.9.1 hd6dc26d_0 conda-forge
libiconv 1.17 hd590300_2 conda-forge
liblapack 3.9.0 5_h92ddd45_netlib conda-forge
libpng 1.6.37 h21135ba_2 conda-forge
libstdcxx-ng 13.2.0 h7e041cc_5 conda-forge
libtiff 4.3.0 hf544144_1 conda-forge
libwebp-base 1.3.2 hd590300_0 conda-forge
libxml2 2.10.4 hf1b16e4_1 defaults
lz4-c 1.9.3 h9c3ff4c_1 conda-forge
mkl 2023.1.0 h213fc3f_46344 defaults
ncurses 6.4 h6a678d5_0 defaults
nettle 3.6 he412f7d_0 conda-forge
numpy 1.24.4 py38h59b608b_0 conda-forge
olefile 0.47 pyhd8ed1ab_0 conda-forge
openh264 2.1.1 h780b84a_0 conda-forge
openjpeg 2.4.0 hb52868f_1 conda-forge
openssl 3.2.1 hd590300_0 conda-forge
pillow 8.2.0 py38ha0e1e83_1 conda-forge
pip 23.3.1 py38h06a4308_0 defaults
pybind11 2.11.1 pypi_0 pypi
pykeops 2.2.2 pypi_0 pypi
pysocks 1.7.1 pyha2e5f31_6 conda-forge
python 3.8.18 h955ad1f_0 defaults
python_abi 3.8 2_cp38 conda-forge
pytorch 1.12.0 py3.8_cuda11.6_cudnn8.3.2_0 pytorch
pytorch-mutex 1.0 cuda pytorch
readline 8.2 h5eee18b_0 defaults
requests 2.31.0 pyhd8ed1ab_0 conda-forge
setuptools 68.2.2 py38h06a4308_0 defaults
sqlite 3.41.2 h5eee18b_0 defaults
tbb 2021.9.0 hf52228f_0 conda-forge
tk 8.6.12 h1ccaba5_0 defaults
torchvision 0.13.0 py38_cu116 pytorch
typing_extensions 4.10.0 pyha770c72_0 conda-forge
urllib3 2.2.1 pyhd8ed1ab_0 conda-forge
wheel 0.41.2 py38h06a4308_0 defaults
xz 5.4.6 h5eee18b_0 defaults
zlib 1.2.13 h5eee18b_0 defaults
zstd 1.5.0 ha95c52a_0 conda-forge
Has anyone else experienced a similar issue or can provide insight into what might be causing the CUDA_ERROR_INVALID_IMAGE
error with KeOps on an RTX 4090 GPU?
Hi @ZkzMMDC , @wang-jh18-SVM ,
Thanks for your interest in our library, and the detailed reports. I don't have a RTX4090 at hand to try this myself, but I'll be very surprised if this turns out to be a hardware issue. KeOps runs fine on all the Nvidia GPUs that we've had access to since 2017, it does not rely on niche instruction sets.
As far as I can tell, the most likely issue here is that in @wang-jh18-SVM 's report, cudatoolkit==11.6.2
while nvcc==1.8
: the vast majority of KeOps installation issues happen on systems where several versions of CUDA are available, and we somehow mix up the paths.
Could you maybe:
nvcc == 11.6.2
in your conda environment, maybe via e.g. conda install -y -c nvidia/label/cuda-11.6.2 cuda
? Our Dockerfile provides a good reference for a fully functional setup.docker pull getkeops/keops-full:latest
), just to make sure that this is not a harware problem?Best regards, Jean
I would recommend to use cuda >= 12 with recent hardware (though I am not sure it fixes this particular issue). The cuda version used by keOps could be different from the one used by torch. You may install cuda locally and force the cuda used by keops by setting the env 'CUDA_PATH'.
The same issue, I use the pykeops docker.
When I run the following test “pykeops.test_torch_bindings() ” to make sure Keops work:
have you solved that problem?
Hi, I still don't know why, but when I install pytorch with pip rather than conda, it works.
python 3.8 ,cuda 11.2, GPU RTX4090
When I run the following test “pykeops.test_torch_bindings() ” to make sure Keops work: