Ruyi-Zha / naf_cbct

MIT License
106 stars 20 forks source link

fatal error: cuda_runtime.h: No such file or directory #8

Closed jiayangshi closed 9 months ago

jiayangshi commented 1 year ago

Thank you for your great work. When I tried to run python train.py --config ./config/chest_50.yaml, I encountered this error message

  File "/home/shij3/anaconda3/envs/naf/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
    subprocess.run(
  File "/home/shij3/anaconda3/envs/naf/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/shij3/naf_cbct/train.py", line 14, in <module>
    from src.trainer import Trainer
  File "/home/shij3/naf_cbct/src/trainer.py", line 12, in <module>
    from .encoder import get_encoder
  File "/home/shij3/naf_cbct/src/encoder/__init__.py", line 1, in <module>
    from .hashencoder import HashEncoder
  File "/home/shij3/naf_cbct/src/encoder/hashencoder/__init__.py", line 1, in <module>
    from .hashgrid import HashEncoder
  File "/home/shij3/naf_cbct/src/encoder/hashencoder/hashgrid.py", line 8, in <module>
    from .backend import _backend
  File "/home/shij3/naf_cbct/src/encoder/hashencoder/backend.py", line 6, in <module>
    _backend = load(name='_hash_encoder',
  File "/home/shij3/anaconda3/envs/naf/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1144, in load
    return _jit_compile(
  File "/home/shij3/anaconda3/envs/naf/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1357, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/shij3/anaconda3/envs/naf/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1469, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/home/shij3/anaconda3/envs/naf/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1756, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension '_hash_encoder': [1/2] /home/shij3/anaconda3/envs/naf/bin/nvcc  -DTORCH_EXTENSION_NAME=_hash_encoder -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/shij3/anaconda3/envs/naf/lib/python3.9/site-packages/torch/include -isystem /home/shij3/anaconda3/envs/naf/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/shij3/anaconda3/envs/naf/lib/python3.9/site-packages/torch/include/TH -isystem /home/shij3/anaconda3/envs/naf/lib/python3.9/site-packages/torch/include/THC -isystem /home/shij3/anaconda3/envs/naf/include -isystem /home/shij3/anaconda3/envs/naf/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -std=c++14 -c /home/shij3/naf_cbct/src/encoder/hashencoder/src/hashencoder.cu -o hashencoder.cuda.o 
FAILED: hashencoder.cuda.o 
/home/shij3/anaconda3/envs/naf/bin/nvcc  -DTORCH_EXTENSION_NAME=_hash_encoder -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/shij3/anaconda3/envs/naf/lib/python3.9/site-packages/torch/include -isystem /home/shij3/anaconda3/envs/naf/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/shij3/anaconda3/envs/naf/lib/python3.9/site-packages/torch/include/TH -isystem /home/shij3/anaconda3/envs/naf/lib/python3.9/site-packages/torch/include/THC -isystem /home/shij3/anaconda3/envs/naf/include -isystem /home/shij3/anaconda3/envs/naf/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -std=c++14 -c /home/shij3/naf_cbct/src/encoder/hashencoder/src/hashencoder.cu -o hashencoder.cuda.o 
<command-line>: fatal error: cuda_runtime.h: No such file or directory
compilation terminated.
ninja: build stopped: subcommand failed.

The nvcc was installed through conda conda install -c "nvidia/label/cuda-11.3.0" cuda-nvcc and environment variable was set with export CUDA_HOME=$CONDA_PREFIX. nvcc --version shows

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Mar_21_19:15:46_PDT_2021
Cuda compilation tools, release 11.3, V11.3.58
Build cuda_11.3.r11.3/compiler.29745058_0

I would like to ask if you know potential solution to this ? Did you use locally installed nvcc? Is it possible to use the nvcc in conda? Thank you.

Ruyi-Zha commented 1 year ago

Hi,

Thanks for your interest. It seems that the issue is due to missing cuda_runtime. I suggest using the precompiled CUDA in the pytorch package instead of the locally installed one.

I have updated the setup instruction in README.md. Please have a try to see if it works for you.

Ruyi

jiayangshi commented 1 year ago

Hi,

Thank you for your reply. Tried again and still couldn't sort it out, do you know how can point to use precompiled CUDA in the pytorch package?

After installation followed by your README.md, I tried to run python train.py --config ./config/chest_50.yaml. And it reports

Traceback (most recent call last):
  File "/home/shij3/naf_cbct/train.py", line 10, in <module>
    from src.trainer import Trainer
  File "/home/shij3/naf_cbct/src/trainer.py", line 12, in <module>
    from .encoder import get_encoder
  File "/home/shij3/naf_cbct/src/encoder/__init__.py", line 1, in <module>
    from .hashencoder import HashEncoder
  File "/home/shij3/naf_cbct/src/encoder/hashencoder/__init__.py", line 1, in <module>
    from .hashgrid import HashEncoder
  File "/home/shij3/naf_cbct/src/encoder/hashencoder/hashgrid.py", line 8, in <module>
    from .backend import _backend
  File "/home/shij3/naf_cbct/src/encoder/hashencoder/backend.py", line 6, in <module>
    _backend = load(name='_hash_encoder',
  File "/home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1144, in load
    return _jit_compile(
  File "/home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1357, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1446, in _write_ninja_file_and_build_library
    extra_ldflags = _prepare_ldflags(
  File "/home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1554, in _prepare_ldflags
    extra_ldflags.append(f'-L{_join_cuda_home("lib64")}')
  File "/home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2058, in _join_cuda_home
    raise EnvironmentError('CUDA_HOME environment variable is not set. '
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

Because as you mentioned, we should use cuda coming along with pytorch, I set the environment variable to use from condo environment with export CUDA_HOME=$CONDA_PREFIX. And then the error is nvcc not found:

Traceback (most recent call last):
  File "/home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
    subprocess.run(
  File "/home/shij3/anaconda3/envs/naf_test/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/shij3/naf_cbct/train.py", line 10, in <module>
    from src.trainer import Trainer
  File "/home/shij3/naf_cbct/src/trainer.py", line 12, in <module>
    from .encoder import get_encoder
  File "/home/shij3/naf_cbct/src/encoder/__init__.py", line 1, in <module>
    from .hashencoder import HashEncoder
  File "/home/shij3/naf_cbct/src/encoder/hashencoder/__init__.py", line 1, in <module>
    from .hashgrid import HashEncoder
  File "/home/shij3/naf_cbct/src/encoder/hashencoder/hashgrid.py", line 8, in <module>
    from .backend import _backend
  File "/home/shij3/naf_cbct/src/encoder/hashencoder/backend.py", line 6, in <module>
    _backend = load(name='_hash_encoder',
  File "/home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1144, in load
    return _jit_compile(
  File "/home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1357, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1469, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1756, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension '_hash_encoder': [1/3] /home/shij3/anaconda3/envs/naf_test/bin/nvcc  -DTORCH_EXTENSION_NAME=_hash_encoder -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/include -isystem /home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/include/TH -isystem /home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/include/THC -isystem /home/shij3/anaconda3/envs/naf_test/include -isystem /home/shij3/anaconda3/envs/naf_test/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -std=c++14 -c /home/shij3/naf_cbct/src/encoder/hashencoder/src/hashencoder.cu -o hashencoder.cuda.o 
FAILED: hashencoder.cuda.o 
/home/shij3/anaconda3/envs/naf_test/bin/nvcc  -DTORCH_EXTENSION_NAME=_hash_encoder -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/include -isystem /home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/include/TH -isystem /home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/include/THC -isystem /home/shij3/anaconda3/envs/naf_test/include -isystem /home/shij3/anaconda3/envs/naf_test/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -std=c++14 -c /home/shij3/naf_cbct/src/encoder/hashencoder/src/hashencoder.cu -o hashencoder.cuda.o 
/bin/sh: 1: /home/shij3/anaconda3/envs/naf_test/bin/nvcc: not found
[2/3] c++ -MMD -MF bindings.o.d -DTORCH_EXTENSION_NAME=_hash_encoder -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/include -isystem /home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/include/TH -isystem /home/shij3/anaconda3/envs/naf_test/lib/python3.9/site-packages/torch/include/THC -isystem /home/shij3/anaconda3/envs/naf_test/include -isystem /home/shij3/anaconda3/envs/naf_test/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -c /home/shij3/naf_cbct/src/encoder/hashencoder/src/bindings.cpp -o bindings.o 
ninja: build stopped: subcommand failed.

Here comes in my original try from my original question to install nvcc from conda with conda install -c "nvidia/label/cuda-11.3.0" cuda-nvcc. But do you mean we can actually use nvcc coming along with the installed pytorch, how can point to use precompiled CUDA in the pytorch package?

Ruyi-Zha commented 1 year ago

Hi, yes we use nvcc/cuda coming along with the installed pytorch. Pytorch should automatically point to the precompiled cuda if it is corrected installed. I didn't manually specify the variable for cuda. I tried my code on different 30-series GPU systems (even the one without locally installed CUDA) and they all worked fine.

I suggest cleaning all installed nvcc/cuda in your system and conda environment. Then follow README.md to create and setup the new environment. Note that nvcc/cuda is already included in the pytorch command pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113. You do not need to install it again with conda command conda install -c "nvidia/label/cuda-11.3.0" cuda-nvcc. Hope this help.

Ruyi