Open albertfgu opened 1 year ago
In https://github.com/getkeops/keops/issues/238, I also mentioned some earlier issues with pykeops==2.x. Note that these two issues are with two different environments which are printing different error messages, but I wonder if there's an underlying issue that causes pykeops==1.5
to work fine but pykeops==2.x
to be difficult to install.
I closed that other issue to consolidate these into one thread. In response to some of the points raised in that issue:
I see several hypotheses for your issue, one of them being that your CUDA 10.2 folder does not contain the development headers. Notably, KeOps expects to find nvrtc.h and cuda.h in $CUDA_PATH/include. If this fails, we also try:
- /opt/cuda/include/, /opt/cuda/targets/x86_64-linux/include/,
- /usr/local/cuda/include/, /usr/local/cuda/targets/x86_64-linux/include/,
- /usr/local/cuda-10.2/include/, /usr/local/cuda-10.2/targets/x86_64-linux/include/.
(The code for this is available here.) Does your CUDA 10.2 installation contain these files?
On both environments, I tried setting export CUDA_PATH=/usr/local/cuda
and checked that nvrtc.h
and cuda.h
are found under $CUDA_PATH/include
. Both still had the same errors
Alternatively, there may be a mis-match between the concurrent versions of CUDA that are present on your system: the PTX in CUDA_ERROR_INVALID_PTX refers to the intermediate representation that is used by the CUDA compiler. What may be happening here is that KeOps somehow used your CUDA v11.1 compiler to produce the PTX, and then used CUDA v10.2 to compile or access it, resulting in this error.
I'm not quite sure what this means, to be honest. Is there a way to dig deeper into whether this is causing the issue?
Hi @albertfgu, Thanks for your kind words!
In the first environment, are you running the import pykeops
in a terminal or in a Jupyter notebook cell? In the second case, the KeOps C++ backend may output additional error messages in the terminal that is running the Jupyter server (we should redirect the stderr for this, but haven't done it yet...). This may be helpful.
In any case, I notice that you get warnings that are related to libstdc++.so (= the C++ standard library), which remind me of the compatibility bug that we fixed here by hand in our official Dockerfile. In a nutshell: last time I checked in early July 2022, conda was shipping a version of libstdc++ which is older than that of Ubuntu 22.04, and this small mis-match causes a lot of compatibility bugs. To check this, could you run something like:
ls /usr/lib/x86_64-linux-gnu/libstdc++*
ls /path/to/conda/lib/libstdc++*
And let us know about the result? For your information, in our official Docker image, these command allow us to see that Ubuntu 22.04 currently ships libstdc++.so.6.0.30
while conda ships libstdc++.so.6.0.28
. The manual fix:
rm /opt/conda/lib/libstdc++.so.6
ln -s /usr/lib/x86_64-linux-gnu/libstdc++.so.6 /opt/conda/lib/libstdc++.so.6
lets us destroy the link /opt/conda/lib/libstdc++.so.6
to /opt/conda/lib/libstdc++.so.6.0.28
and replace it with a link to /usr/lib/x86_64-linux-gnu/libstdc++.so.6
(= /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
).
What do you think? This may also be helpful for the second environment.
Best regards, Jean
I'm not running anything in a notebook; everything is in terminal.
On one of my environments, these are the outputs of the commands:
❯ ls /usr/lib/x86_64-linux-gnu/libstdc++*
/usr/lib/x86_64-linux-gnu/libstdc++.so.6 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21
❯ ls /dfs/scratch1/albertgu/anaconda3/lib/libstdc++*
/dfs/scratch1/albertgu/anaconda3/lib/libstdc++.so
/dfs/scratch1/albertgu/anaconda3/lib/libstdc++.so.6
/dfs/scratch1/albertgu/anaconda3/lib/libstdc++.so.6.0.26
I followed the symlink suggestion and upgraded from keops==1.5 to 2.1, and get the same error:
❯ python
Python 3.8.12 (default, Oct 12 2021, 13:49:34)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pykeops
pykeops.clean_pykeops()>>> pykeops.clean_pykeops()
[KeOps] /dfs/scratch1/albertgu/.cache/keops2.1/build has been cleaned.
[KeOps] Compiling cuda jit compiler engine ... OK
[pyKeOps] Compiling nvrtc binder for python ... OK
>>> pykeops.test_torch_bindings()
[KeOps] Generating code for formula Sum_Reduction((Var(0,3,0)-Var(1,3,1))|(Var(0,3,0)-Var(1,3,1)),1) ... OK
[KeOps] error: cuModuleLoadDataEx(&module, target, 0, NULL, NULL) failed with error CUDA_ERROR_INVALID_PTX
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/torch/test_install.py", line 21, in test_torch_bindings
my_conv(x, y).view(-1), torch.tensor(expected_res).type(torch.float32)
File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/torch/generic/generic_red.py", line 624, in __call__
out = GenredAutograd.apply(
File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/torch/generic/generic_red.py", line 78, in forward
myconv = keops_binder["nvrtc" if tagCPUGPU else "cpp"](
File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/keopscore/utils/Cache.py", line 68, in __call__
obj = self.cls(*args)
File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 15, in __init__
super().__init__(*args, fast_init=fast_init)
File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/common/keops_io/LoadKeOps.py", line 31, in __init__
self.init_phase2()
File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 23, in init_phase2
self.launch_keops = pykeops_nvrtc.KeOps_module_float(
RuntimeError: [KeOps] Cuda error.
>>>
I didn't test the other environment that has issues but suspect it would still have the same problem.
Again I only get these issues on keops 2.0 or later. As a sanity check, are these suggested fixes specific to keops==2.x? I certainly don't know much about the internals here, but these suggestions seem quite general (about linux/conda instead of keops) and I'm not sure if they would pertain to something that changed specifically in version 2.0.
For anyone struggling with this on conda, the trick was to set CUDA_PATH
to the root of your conda env assuming you have pytorch-gpu
, cudatoolkit
(and possibly the other cuda packages from the nvidia channel).
Example:
$ export CUDA_PATH=/opt/mambaforge/envs/base
$ python -c 'import pykeops; pykeops.test_torch_bindings()'
@i404788 useful
That still doesn't work for me on all of my environments. The most recent version of pykeops and keopscore did work on the other of my environments that was failing though.
I'm in a fresh conda environment with the following versions:
Installing
pykeops==1.5
works fine. However, on upgrading topykeops==2.1
, I am unable to import the package at all:Thank you for the wonderful package and the continued support!