ClementPinard / Pytorch-Correlation-extension

Custom implementation of Corrleation Module
MIT License
411 stars 77 forks source link

OSError: CUDA_HOME environment variable not set when python setup.py in Dockerfile #95

Open stevezkw1998 opened 1 year ago

stevezkw1998 commented 1 year ago

My Dockerfile

FROM pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime

RUN apt-get update && apt-get install -y git gcc build-essential

RUN mkdir /app
WORKDIR /app

# Install Pytorch Correlation
RUN git clone https://github.com/ClementPinard/Pytorch-Correlation-extension.git
RUN cd Pytorch-Correlation-extension && python setup.py install
RUN cd -

EXPOSE 5252

CMD ["python", "app.py"]

Then raise an Error: OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root. The full error logs:

 => ERROR [13/14] RUN cd Pytorch-Correlation-extension && python setup.py install                                                                      2.2s
------
 > [13/14] RUN cd Pytorch-Correlation-extension && python setup.py install:
#0 1.843 Traceback (most recent call last):
#0 1.843   File "/app/Pytorch-Correlation-extension/setup.py", line 57, in <module>
#0 1.843     launch_setup()
#0 1.844   File "/app/Pytorch-Correlation-extension/setup.py", line 36, in launch_setup
#0 1.844     Extension('spatial_correlation_sampler_backend',
#0 1.844   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1048, in CUDAExtension
#0 1.844     library_dirs += library_paths(cuda=True)
#0 1.844   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1179, in library_paths
#0 1.845     if (not os.path.exists(_join_cuda_home(lib_dir)) and
#0 1.845   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2223, in _join_cuda_home
#0 1.845     raise EnvironmentError('CUDA_HOME environment variable is not set. '
#0 1.845 OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
------
Dockerfile:33
--------------------
  31 |     # Install Pytorch Correlation
  32 |     RUN git clone https://github.com/ClementPinard/Pytorch-Correlation-extension.git
  33 | >>> RUN cd Pytorch-Correlation-extension && python setup.py install
  34 |     RUN cd -
  35 |
--------------------
ERROR: failed to solve: process "/bin/sh -c cd Pytorch-Correlation-extension && python setup.py install" did not complete successfully: exit code: 1
ClementPinard commented 1 year ago

Hi, looks like to met that you would need to use the devel image and not the runtime since you need to be able to compile against torch and cuda. SO I would try changing the docker image name from pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime to pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel

stevezkw1998 commented 1 year ago

Hi @ClementPinard Thank you for your advice After I changed the docker image name from pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime to pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel The former issues fixed, but I has new issue:

 => ERROR [13/14] RUN cd Pytorch-Correlation-extension && python setup.py install                                                                                                      15.9s 
------
 > [13/14] RUN cd Pytorch-Correlation-extension && python setup.py install:
#0 1.665 No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
#0 1.689 running install
#0 1.689 /opt/conda/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
#0 1.689   warnings.warn(
#0 1.752 /opt/conda/lib/python3.10/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
#0 1.752   warnings.warn(
#0 1.818 running bdist_egg
#0 1.830 running egg_info
#0 1.830 creating Correlation_Module/spatial_correlation_sampler.egg-info
#0 1.835 writing Correlation_Module/spatial_correlation_sampler.egg-info/PKG-INFO
#0 1.836 writing dependency_links to Correlation_Module/spatial_correlation_sampler.egg-info/dependency_links.txt
#0 1.836 writing requirements to Correlation_Module/spatial_correlation_sampler.egg-info/requires.txt
#0 1.836 writing top-level names to Correlation_Module/spatial_correlation_sampler.egg-info/top_level.txt
#0 1.836 writing manifest file 'Correlation_Module/spatial_correlation_sampler.egg-info/SOURCES.txt'
#0 1.842 /opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
#0 1.842   warnings.warn(msg.format('we could not find ninja.'))
#0 1.846 reading manifest file 'Correlation_Module/spatial_correlation_sampler.egg-info/SOURCES.txt'
#0 1.847 adding license file 'LICENSE'
#0 1.847 writing manifest file 'Correlation_Module/spatial_correlation_sampler.egg-info/SOURCES.txt'
#0 1.848 installing library code to build/bdist.linux-x86_64/egg
#0 1.848 running install_lib
#0 1.848 running build_py
#0 1.849 creating build
#0 1.849 creating build/lib.linux-x86_64-cpython-310
#0 1.849 creating build/lib.linux-x86_64-cpython-310/spatial_correlation_sampler
#0 1.849 copying Correlation_Module/spatial_correlation_sampler/spatial_correlation_sampler.py -> build/lib.linux-x86_64-cpython-310/spatial_correlation_sampler
#0 1.850 copying Correlation_Module/spatial_correlation_sampler/__init__.py -> build/lib.linux-x86_64-cpython-310/spatial_correlation_sampler
#0 1.850 running build_ext
#0 1.868 building 'spatial_correlation_sampler_backend' extension
#0 1.868 creating build/temp.linux-x86_64-cpython-310
#0 1.868 creating build/temp.linux-x86_64-cpython-310/Correlation_Module
#0 1.869 gcc -pthread -B /opt/conda/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/conda/include -fPIC -O2 -isystem /opt/conda/include -fPIC -DUSE_CUDA -I/opt/conda/lib/python3.10/site-packages/torch/include -I/opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.10/site-packages/torch/include/TH -I/opt/conda/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.10 -c Correlation_Module/correlation.cpp -o build/temp.linux-x86_64-cpython-310/Correlation_Module/correlation.o -std=c++14 -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=spatial_correlation_sampler_backend -D_GLIBCXX_USE_CXX11_ABI=0
#0 15.65 Traceback (most recent call last):
#0 15.65   File "/app/Pytorch-Correlation-extension/setup.py", line 69, in <module>
#0 15.65     launch_setup()
#0 15.65   File "/app/Pytorch-Correlation-extension/setup.py", line 37, in launch_setup
#0 15.65     setup(
#0 15.65   File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 87, in setup
#0 15.65     return distutils.core.setup(**attrs)
#0 15.65   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
#0 15.65     return run_commands(dist)
#0 15.65   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
#0 15.65     dist.run_commands()
#0 15.65   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
#0 15.65     self.run_command(cmd)
#0 15.65   File "/opt/conda/lib/python3.10/site-packages/setuptools/dist.py", line 1208, in run_command
#0 15.65     super().run_command(command)
#0 15.65   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
#0 15.65     cmd_obj.run()
#0 15.65   File "/opt/conda/lib/python3.10/site-packages/setuptools/command/install.py", line 74, in run
#0 15.66   File "/opt/conda/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
#0 15.66     _build_ext.build_extension(self, ext)
#0 15.66   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 549, in build_extension
#0 15.66     objects = self.compiler.compile(
#0 15.66   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/ccompiler.py", line 599, in compile
#0 15.66     self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
#0 15.66   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 581, in unix_wrap_single_compile
#0 15.66     cflags = unix_cuda_flags(cflags)
#0 15.66   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 548, in unix_cuda_flags
#0 15.66     cflags + _get_cuda_arch_flags(cflags))
#0 15.66   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1773, in _get_cuda_arch_flags
#0 15.66     arch_list[-1] += '+PTX'
#0 15.66 IndexError: list index out of range
------
Dockerfile:33
--------------------
  31 |     # Install Pytorch Correlation
  32 |     RUN git clone https://github.com/ClementPinard/Pytorch-Correlation-extension.git
  33 | >>> RUN cd Pytorch-Correlation-extension && python setup.py install
  34 |     RUN cd -
  35 |
--------------------
ERROR: failed to solve: process "/bin/sh -c cd Pytorch-Correlation-extension && python setup.py install" did not complete successfully: exit code: 1
Docker build failed with error: Command 'docker build -t sam-track:1.0.0 ..' returned non-zero exit status 1.
ClementPinard commented 1 year ago

See this related issue : https://github.com/ClementPinard/Pytorch-Correlation-extension/issues/90

GPU is not available during docker build so you need to figure out your compute capbilities beforehand and set the TORCH_CUDA_ARCH_LIST environment variable accordingly

stevezkw1998 commented 1 year ago

Hi @ClementPinard Thank you for your solution But I may need to deploy my docker image to different computer Is there any general solution to solve TORCH_CUDA_ARCH_LIST env var issue?

ClementPinard commented 5 months ago

If you don't know what the gpu cuda capabilties of your machine will be, your best bet is to compile for as much architectures as possible, or wait for the docker to be launched to compile the library. Compiled code cannot be generic