getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.03k stars 65 forks source link

Installation error with pykeops==2.1 but works with pykeops==1.5 #265

Open albertfgu opened 1 year ago

albertfgu commented 1 year ago

I'm in a fresh conda environment with the following versions:

❯ python --version
Python 3.9.12

❯ pip list | grep torch
torch                   1.11.0
torchaudio              0.11.0
torchmetrics            0.9.3
torchtext               0.12.0
torchvision             0.12.0

❯ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

Installing pykeops==1.5 works fine. However, on upgrading to pykeops==2.1, I am unable to import the package at all:

❯ python
>>> import pykeops
pykeops.[pyKeOps] Compiling nvrtc binder for python ... clean_pykeops/usr/bin/ld: warning: /home/albertgu/disk/miniconda3/envs/state-spaces/lib/libstdc++.so: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
/usr/bin/ld: warning: /home/albertgu/disk/miniconda3/envs/state-spaces/lib/libstdc++.so: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/albertgu/disk/miniconda3/envs/state-spaces/lib/python3.9/site-packages/pykeops/__init__.py", line 43, in <module>
    compile_jit_binary()
  File "/home/albertgu/disk/miniconda3/envs/state-spaces/lib/python3.9/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 82, in compile_jit_binary
    KeOps_OS_Run(compile_command)
  File "/home/albertgu/disk/miniconda3/envs/state-spaces/lib/python3.9/site-packages/keopscore/utils/misc_utils.py", line 41, in KeOps_OS_Run
    KeOps_Error("Error compiling formula.")
  File "/home/albertgu/disk/miniconda3/envs/state-spaces/lib/python3.9/site-packages/keopscore/utils/misc_utils.py", line 28, in KeOps_Error
    raise ValueError(message)
ValueError: [KeOps] Error : Error compiling formula. (error at line 41 in file /home/albertgu/disk/miniconda3/envs/state-spaces/lib/python3.9/site-packages/keopscore/utils/misc_utils.py)

Thank you for the wonderful package and the continued support!

albertfgu commented 1 year ago

In https://github.com/getkeops/keops/issues/238, I also mentioned some earlier issues with pykeops==2.x. Note that these two issues are with two different environments which are printing different error messages, but I wonder if there's an underlying issue that causes pykeops==1.5 to work fine but pykeops==2.x to be difficult to install.

I closed that other issue to consolidate these into one thread. In response to some of the points raised in that issue:

I see several hypotheses for your issue, one of them being that your CUDA 10.2 folder does not contain the development headers. Notably, KeOps expects to find nvrtc.h and cuda.h in $CUDA_PATH/include. If this fails, we also try:

  • /opt/cuda/include/, /opt/cuda/targets/x86_64-linux/include/,
  • /usr/local/cuda/include/, /usr/local/cuda/targets/x86_64-linux/include/,
  • /usr/local/cuda-10.2/include/, /usr/local/cuda-10.2/targets/x86_64-linux/include/.

(The code for this is available here.) Does your CUDA 10.2 installation contain these files?

On both environments, I tried setting export CUDA_PATH=/usr/local/cuda and checked that nvrtc.h and cuda.h are found under $CUDA_PATH/include. Both still had the same errors

Alternatively, there may be a mis-match between the concurrent versions of CUDA that are present on your system: the PTX in CUDA_ERROR_INVALID_PTX refers to the intermediate representation that is used by the CUDA compiler. What may be happening here is that KeOps somehow used your CUDA v11.1 compiler to produce the PTX, and then used CUDA v10.2 to compile or access it, resulting in this error.

I'm not quite sure what this means, to be honest. Is there a way to dig deeper into whether this is causing the issue?

jeanfeydy commented 1 year ago

Hi @albertfgu, Thanks for your kind words!

In the first environment, are you running the import pykeops in a terminal or in a Jupyter notebook cell? In the second case, the KeOps C++ backend may output additional error messages in the terminal that is running the Jupyter server (we should redirect the stderr for this, but haven't done it yet...). This may be helpful.

In any case, I notice that you get warnings that are related to libstdc++.so (= the C++ standard library), which remind me of the compatibility bug that we fixed here by hand in our official Dockerfile. In a nutshell: last time I checked in early July 2022, conda was shipping a version of libstdc++ which is older than that of Ubuntu 22.04, and this small mis-match causes a lot of compatibility bugs. To check this, could you run something like:

ls /usr/lib/x86_64-linux-gnu/libstdc++*
ls /path/to/conda/lib/libstdc++*

And let us know about the result? For your information, in our official Docker image, these command allow us to see that Ubuntu 22.04 currently ships libstdc++.so.6.0.30 while conda ships libstdc++.so.6.0.28. The manual fix:

rm /opt/conda/lib/libstdc++.so.6
ln -s /usr/lib/x86_64-linux-gnu/libstdc++.so.6 /opt/conda/lib/libstdc++.so.6

lets us destroy the link /opt/conda/lib/libstdc++.so.6 to /opt/conda/lib/libstdc++.so.6.0.28 and replace it with a link to /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (= /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30).

What do you think? This may also be helpful for the second environment.

Best regards, Jean

albertfgu commented 1 year ago

I'm not running anything in a notebook; everything is in terminal.

On one of my environments, these are the outputs of the commands:

❯ ls /usr/lib/x86_64-linux-gnu/libstdc++*
/usr/lib/x86_64-linux-gnu/libstdc++.so.6  /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21
❯ ls /dfs/scratch1/albertgu/anaconda3/lib/libstdc++*
/dfs/scratch1/albertgu/anaconda3/lib/libstdc++.so
/dfs/scratch1/albertgu/anaconda3/lib/libstdc++.so.6
/dfs/scratch1/albertgu/anaconda3/lib/libstdc++.so.6.0.26

I followed the symlink suggestion and upgraded from keops==1.5 to 2.1, and get the same error:

❯ python
Python 3.8.12 (default, Oct 12 2021, 13:49:34)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pykeops
pykeops.clean_pykeops()>>> pykeops.clean_pykeops()
[KeOps] /dfs/scratch1/albertgu/.cache/keops2.1/build has been cleaned.
[KeOps] Compiling cuda jit compiler engine ... OK
[pyKeOps] Compiling nvrtc binder for python ... OK
>>> pykeops.test_torch_bindings()
[KeOps] Generating code for formula Sum_Reduction((Var(0,3,0)-Var(1,3,1))|(Var(0,3,0)-Var(1,3,1)),1) ... OK

[KeOps] error: cuModuleLoadDataEx(&module, target, 0, NULL, NULL) failed with error CUDA_ERROR_INVALID_PTX

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/torch/test_install.py", line 21, in test_torch_bindings
    my_conv(x, y).view(-1), torch.tensor(expected_res).type(torch.float32)
  File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/torch/generic/generic_red.py", line 624, in __call__
    out = GenredAutograd.apply(
  File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/torch/generic/generic_red.py", line 78, in forward
    myconv = keops_binder["nvrtc" if tagCPUGPU else "cpp"](
  File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/keopscore/utils/Cache.py", line 68, in __call__
    obj = self.cls(*args)
  File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 15, in __init__
    super().__init__(*args, fast_init=fast_init)
  File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/common/keops_io/LoadKeOps.py", line 31, in __init__
    self.init_phase2()
  File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 23, in init_phase2
    self.launch_keops = pykeops_nvrtc.KeOps_module_float(
RuntimeError: [KeOps] Cuda error.
>>>

I didn't test the other environment that has issues but suspect it would still have the same problem.

Again I only get these issues on keops 2.0 or later. As a sanity check, are these suggested fixes specific to keops==2.x? I certainly don't know much about the internals here, but these suggestions seem quite general (about linux/conda instead of keops) and I'm not sure if they would pertain to something that changed specifically in version 2.0.

i404788 commented 1 year ago

For anyone struggling with this on conda, the trick was to set CUDA_PATH to the root of your conda env assuming you have pytorch-gpu, cudatoolkit (and possibly the other cuda packages from the nvidia channel).

Example:

$ export CUDA_PATH=/opt/mambaforge/envs/base
$ python -c 'import pykeops; pykeops.test_torch_bindings()'
miRemid commented 1 year ago

@i404788 useful

albertfgu commented 1 year ago

That still doesn't work for me on all of my environments. The most recent version of pykeops and keopscore did work on the other of my environments that was failing though.