FalkonML / falkon

Large-scale, multi-GPU capable, kernel solver
https://falkonml.github.io/falkon/
MIT License
181 stars 22 forks source link

RuntimeError: Not compiled with CUDA support #49

Closed parthe closed 9 months ago

parthe commented 2 years ago
      1 options = falkon.FalkonOptions(keops_active="force")
      3 kernel = falkon.kernels.GaussianKernel(sigma=1, opt=options)
----> 4 flk = falkon.Falkon(kernel=kernel, penalty=1e-5, M=5000, options=options)

yields this error message which I am unable to debug. Please help.

File ~/.conda/envs/falkon_env/lib/python3.10/site-packages/falkon/models/falkon.py:132, in Falkon.__init__(self, kernel, penalty, M, center_selection, maxiter, seed, error_fn, error_every, weight_fn, options)
    130 self.maxiter = maxiter
    131 self.weight_fn = weight_fn
--> 132 self._init_cuda()
    133 self.beta_ = None

File ~/.conda/envs/falkon_env/lib/python3.10/site-packages/falkon/models/model_utils.py:70, in FalkonBase._init_cuda(self)
     68 if self.use_cuda_:
     69     torch.cuda.init()
---> 70     self.num_gpus = devices.num_gpus(self.options)

File ~/.conda/envs/falkon_env/lib/python3.10/site-packages/falkon/utils/devices.py:212, in num_gpus(opt)
    210 global __COMP_DATA
    211 if len(__COMP_DATA) == 0:
--> 212     get_device_info(opt)
    213 return len([c for c in __COMP_DATA.keys() if c >= 0])

File ~/.conda/envs/falkon_env/lib/python3.10/site-packages/falkon/utils/devices.py:200, in get_device_info(opt)
    197     return __COMP_DATA
    199 for g in range(0, tcd.device_count()):
--> 200     __COMP_DATA = _get_gpu_device_info(opt, g, __COMP_DATA)
    202 if len(__COMP_DATA) == 0:
    203     raise RuntimeError("No suitable device found. Enable option 'use_cpu' "
    204                        "if no GPU is available.")

File ~/.conda/envs/falkon_env/lib/python3.10/site-packages/falkon/utils/devices.py:92, in _get_gpu_device_info(opt, g, data_dict)
     83 # try:
     84 #     from ..cuda.cudart_gpu import cuda_meminfo
     85 # except Exception as e:
   (...)
     89 # Some of the CUDA calls in here may change the current device,
     90 # this ensures it gets reset at the end.
     91 with tcd.device(g):
---> 92     mem_free, mem_total = cuda_mem_get_info(g)
     93     mem_used = mem_total - mem_free
     94     # noinspection PyUnresolvedReferences

RuntimeError: Not compiled with CUDA support
parthe commented 2 years ago

pykeops was successfully installed on my machine (using cuda) and tested using the commands python -c "import pykeops; pykeops.test_numpy_bindings(); pykeops.test_torch_bindings()"

Giodiro commented 2 years ago

Hi @parthe This error doesn't come from keops, but from falkon itself. Did you install it with python setup.py develop or pip install .?

parthe commented 2 years ago

I installed using the command pip install git+https://github.com/falkonml/falkon.git as instructed here

parthe commented 2 years ago

When I install using python setup.py develop the following log is printed

No CUDA runtime is found, using CUDA_HOME='/home/$USER/.conda/envs/Falkon_ML'
running develop
running egg_info
writing falkon.egg-info/PKG-INFO
writing dependency_links to falkon.egg-info/dependency_links.txt
writing requirements to falkon.egg-info/requires.txt
writing top-level names to falkon.egg-info/top_level.txt
reading manifest file 'falkon.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE'
writing manifest file 'falkon.egg-info/SOURCES.txt'
running build_ext
copying build/lib.linux-x86_64-3.10/falkon/c_ext.so -> falkon
copying build/lib.linux-x86_64-3.10/falkon/la_helpers/cyblas.so -> falkon/la_helpers
Creating /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages/falkon.egg-link (link to .)
Adding falkon 0.7.5 to easy-install.pth file

Installed /home/$USER/falkon_ml/falkon
Processing dependencies for falkon==0.7.5
Searching for pykeops@ git+https://github.com/getkeops/keops@ad044a671fdc3c2790b0321f6b9f9b5aa3d220df#subdirectory=pykeops
Reading https://pypi.org/simple/pykeops/
Downloading https://files.pythonhosted.org/packages/8c/9a/ae3931ca85e2a05707d07b0f1d34474939c85e2318335eadb92dd02be3b7/pykeops-2.1.tar.gz#sha256=770894e06b497d9640e04471752ee08e5d936809e571e12db1b4dea03c862457
Best match: pykeops 2.1
Processing pykeops-2.1.tar.gz
Writing /tmp/easy_install-2obkpcv5/pykeops-2.1/setup.cfg
Running pykeops-2.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-2obkpcv5/pykeops-2.1/egg-dist-tmp-_l8xu111
creating /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages/pykeops-2.1-py3.10.egg
Extracting pykeops-2.1-py3.10.egg to /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages
Adding pykeops 2.1 to easy-install.pth file

Installed /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages/pykeops-2.1-py3.10.egg
Searching for keopscore@ git+https://github.com/getkeops/keops@ad044a671fdc3c2790b0321f6b9f9b5aa3d220df#subdirectory=keopscore
Reading https://pypi.org/simple/keopscore/
Downloading https://files.pythonhosted.org/packages/e0/0b/fddeee9a4b5808e8f8bd084804d6a2996096f9a959cb0e54d9b61c5762b3/keopscore-2.1.tar.gz#sha256=15db70dda353fe6b00102b6a9043462bae89f6eea8a9be72426c089096d9d5f0
Best match: keopscore 2.1
Processing keopscore-2.1.tar.gz
Writing /tmp/easy_install-zqpfnlbl/keopscore-2.1/setup.cfg
Running keopscore-2.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-zqpfnlbl/keopscore-2.1/egg-dist-tmp-dk2cmvyg
creating /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages/keopscore-2.1-py3.10.egg
Extracting keopscore-2.1-py3.10.egg to /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages
Adding keopscore 2.1 to easy-install.pth file

Installed /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages/keopscore-2.1-py3.10.egg
Searching for psutil
Reading https://pypi.org/simple/psutil/
Downloading https://files.pythonhosted.org/packages/6d/c6/6a4e46802e8690d50ba6a56c7f79ac283e703fcfa0fdae8e41909c8cef1f/psutil-5.9.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=29a442e25fab1f4d05e2655bb1b8ab6887981838d22effa2396d584b740194de
Best match: psutil 5.9.1
Processing psutil-5.9.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Installing psutil-5.9.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl to /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages
Adding psutil 5.9.1 to easy-install.pth file

Installed /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages/psutil-5.9.1-py3.10-linux-x86_64.egg
Searching for scikit-learn
Reading https://pypi.org/simple/scikit-learn/
Downloading https://files.pythonhosted.org/packages/43/bc/7130ffd49a1cf72659c61eb94d8f037bc5502c94866f407c0219d929e758/scikit_learn-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=47464c110eaa9ed9d1fe108cb403510878c3d3a40f110618d2a19b2190a3e35c
Best match: scikit-learn 1.1.1
Processing scikit_learn-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Installing scikit_learn-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl to /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages
Adding scikit-learn 1.1.1 to easy-install.pth file

Installed /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages/scikit_learn-1.1.1-py3.10-linux-x86_64.egg
Searching for pybind11
Reading https://pypi.org/simple/pybind11/
Downloading https://files.pythonhosted.org/packages/9a/7f/855560aa568e50bea6012ed535e6b8c436e99394f3e5a649d44d2e557242/pybind11-2.10.0-py3-none-any.whl#sha256=6bbc7a2f79689307f0d8d240172851955fc214b33e4cbd7fdbc9cd7176a09260
Best match: pybind11 2.10.0
Processing pybind11-2.10.0-py3-none-any.whl
Installing pybind11-2.10.0-py3-none-any.whl to /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages
Adding pybind11 2.10.0 to easy-install.pth file
Installing pybind11-config script to /home/$USER/.conda/envs/Falkon_ML/bin

Installed /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages/pybind11-2.10.0-py3.10.egg
Searching for threadpoolctl>=2.0.0
Reading https://pypi.org/simple/threadpoolctl/
Downloading https://files.pythonhosted.org/packages/61/cf/6e354304bcb9c6413c4e02a747b600061c21d38ba51e7e544ac7bc66aecc/threadpoolctl-3.1.0-py3-none-any.whl#sha256=8b99adda265feb6773280df41eece7b2e6561b772d21ffd52e372f999024907b
Best match: threadpoolctl 3.1.0
Processing threadpoolctl-3.1.0-py3-none-any.whl
Installing threadpoolctl-3.1.0-py3-none-any.whl to /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages
Adding threadpoolctl 3.1.0 to easy-install.pth file

Installed /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages/threadpoolctl-3.1.0-py3.10.egg
Searching for joblib>=1.0.0
Reading https://pypi.org/simple/joblib/
Downloading https://files.pythonhosted.org/packages/3e/d5/0163eb0cfa0b673aa4fe1cd3ea9d8a81ea0f32e50807b0c295871e4aab2e/joblib-1.1.0-py2.py3-none-any.whl#sha256=f21f109b3c7ff9d95f8387f752d0d9c34a02aa2f7060c2135f465da0e5160ff6
Best match: joblib 1.1.0
Processing joblib-1.1.0-py2.py3-none-any.whl
Installing joblib-1.1.0-py2.py3-none-any.whl to /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages
Adding joblib 1.1.0 to easy-install.pth file

Installed /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages/joblib-1.1.0-py3.10.egg
Searching for numpy==1.22.3
Best match: numpy 1.22.3
Adding numpy 1.22.3 to easy-install.pth file
Installing f2py script to /home/$USER/.conda/envs/Falkon_ML/bin
Installing f2py3 script to /home/$USER/.conda/envs/Falkon_ML/bin
Installing f2py3.10 script to /home/$USER/.conda/envs/Falkon_ML/bin

Using /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages
Searching for scipy==1.8.1
Best match: scipy 1.8.1
Processing scipy-1.8.1-py3.10-linux-x86_64.egg
Adding scipy 1.8.1 to easy-install.pth file

Using /home/$USER/falkon_ml/falkon/.eggs/scipy-1.8.1-py3.10-linux-x86_64.egg
Searching for torch==1.12.0
Best match: torch 1.12.0
Adding torch 1.12.0 to easy-install.pth file
Installing convert-caffe2-to-onnx script to /home/$USER/.conda/envs/Falkon_ML/bin
Installing convert-onnx-to-caffe2 script to /home/$USER/.conda/envs/Falkon_ML/bin
Installing torchrun script to /home/$USER/.conda/envs/Falkon_ML/bin

Using /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages
Searching for typing-extensions==4.1.1
Best match: typing-extensions 4.1.1
Adding typing-extensions 4.1.1 to easy-install.pth file

Using /home/$USER/.conda/envs/Falkon_ML/lib/python3.10/site-packages
Finished processing dependencies for falkon==0.7.5

I dont know why no cuda run-time is found in that folder. I have installed cuda in that folder and there exists a file named /home/$USER/.conda/envs/Falkon_ML/includa/cuda.h

Giodiro commented 2 years ago

Usually you need to install CUDA on the whole system, the default place where your runtime would be is: /usr/local/cuda. To verify CUDA is installed properly the easiest thing is to use pytorch's detection: python -c 'import torch; print(torch.cuda.is_available())'

I usually install CUDA on the whole system, so I'm not sure what your last comment means. You'll need both the CUDA drivers and the CUDA toolkit (which should match the toolkit with which pytorch has been compiled). You also may need to add /usr/local/cuda/bin to your PATH environment variable.

parthe commented 2 years ago

I'm working on a slurm cluster

CUDA is installed, but not in this location: usr/local/bin/

torch.cuda.is_available() returns True

parthe commented 2 years ago

Here is my install script for falkon:

yes | conda create -n Falkon_ML python=3.10 ipython
conda activate Falkon_ML
yes | conda install -c nvidia/label/cuda-11.3.1 cuda-toolkit
yes | conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
module load cmake
module load gpu cuda
CUDA_PATH='/home/$USER/.conda/envs/Falkon_ML/'
git clone https://github.com/FalkonML/falkon.git falkon_ml
python falkon_ml/falkon/setup.py develop
ahabedsoltan commented 2 years ago

I am having the exact same problem. pykeops and pytorch work properly. However, when I run flk = falkon.Falkon(kernel=kernel, penalty=1e-6, M=1000, options=options)

I get 'RuntimeError: Not compiled with CUDA support' same as @parthe first post.

  1. I first created a new Conda environment
  2. I installed CUDA 11.6: conda install -c "nvidia/label/cuda-11.6.2" cuda-toolkit
  3. I installed pytorch: conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
  4. I tried to install FALKON using python setup.py develop. However I got the following error: ` /home/amirhesam/.conda/envs/flk2/bin/nvcc -DTORCH_VERSION_MAJOR=1 -DTORCH_VERSION_MINOR=12 -DTORCH_VERSION_PATCH=1 -DWITH_CUDA -I./falkon/csrc -I/home/amirhesam/.conda/envs/flk2/lib/python3.10/site-packages/torch/include -I/home/amirhesam/.conda/envs/flk2/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/amirhesam/.conda/envs/flk2/lib/python3.10/site-packages/torch/include/TH -I/home/amirhesam/.conda/envs/flk2/lib/python3.10/site-packages/torch/include/THC -I/home/amirhesam/.conda/envs/flk2/include -I/home/amirhesam/.conda/envs/flk2/include/python3.10 -c ./falkon/csrc/cuda/square_norm_cuda.cu -o build/temp.linux-x86_64-cpython-310/./falkon/csrc/cuda/square_norm_cuda.o -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr --compiler-options '-fPIC' --expt-relaxed-constexpr --expt-extended-lambda -O2 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=c_ext -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 -std=c++14 In file included from /home/amirhesam/.conda/envs/flk2/lib/python3.10/site-packages/torch/include/ATen/native/cuda/Reduce.cuh:21, from ./falkon/csrc/cuda/square_norm_cuda.cu:4: /home/amirhesam/.conda/envs/flk2/lib/python3.10/site-packages/torch/include/ATen/native/cuda/jit_utils.h:11:10: fatal error: ATen/cuda/nvrtc_stub/ATenNVRTC.h: No such file or directory

    include <ATen/cuda/nvrtc_stub/ATenNVRTC.h>

      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    compilation terminated. error: command '/home/amirhesam/.conda/envs/flk2/bin/nvcc' failed with exit code 1 `

  5. So I installed FALKON using pip install git+https://github.com/falkonml/falkon.git. This works properly on cpu but not on gpu. It returns 'RuntimeError: Not compiled with CUDA support' with GPU.

I appreciate it if you could help me with this.

Best,

Giodiro commented 2 years ago

Hi @ahabedsoltan Yes, it's a known issue with pytorch 1.12 It seems to have been fixed on their part in the master branch but until a new pytorch version is released, your best option is to downgrade to pytorch 1.11.

Sorry for this, Giacomo

ahabedsoltan commented 2 years ago

Thank you for your prompt answer. Yeah the problem was both torch version and Keops library. I had to switch to pykeops beta version.