NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.17k stars 1.35k forks source link

ninja: error: '/app/csrc/amp_C_frontend.cpp', needed by '/app/build/temp.linux-x86_64-cpython-310/csrc/amp_C_frontend.o', missing and no known rule to make it #1741

Closed Tolga-Karahan closed 8 months ago

Tolga-Karahan commented 8 months ago

Describe the Bug

I used this command as it is suggested in readme file for pip >= 23.1: pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./. When I try to use adamw_apex_fused as optimizer for the Hugging Face model it says that extensions are not installed. I'm building Apex in a Dockerfile, so I also added this command to build extensions: python ./apex/setup.py install --cuda_ext as suggested here, but it gives the error below.

The command in the Dockerfile:

  RUN git clone https://github.com/NVIDIA/apex && cd apex && git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
  RUN export PATH=/opt/conda/pkgs/pytorch-2.0.1-py3.10_cuda11.7_cudnn8.5.0_0/lib/python3.10/site-packages/torch/cuda:$PATH && \
    export C_INCLUDE_PATH=${C_INCLUDE_PATH}:/opt/conda/pkgs/pytorch-2.0.1-py3.10_cuda11.7_cudnn8.5.0_0/lib/python3.10/site-packages/torch/include/ && \  
    conda install -c "nvidia/label/cuda-11.7.0" cuda-nvcc && \
    conda install -c "nvidia/label/cuda-11.7.0" cuda-cudart && \
    pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./apex && \
    python ./apex/setup.py install --cuda_ext

Environment Base image:pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime Python: 3.10.9 Cuda: 11.7 PyTorch: 2.0

Error:

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#19 34.66 /app/./apex/setup.py:46: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
#19 34.66   warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")
#19 35.00 No CUDA runtime is found, using CUDA_HOME='/opt/conda'
#19 35.06 
#19 35.06 Warning: Torch did not find available GPUs on this system.
#19 35.06  If your intention is to cross-compile, this is not an error.
#19 35.06 By default, Apex will cross-compile for Pascal (compute capabilities 6.0, 6.1, 6.2),
#19 35.06 Volta (compute capability 7.0), and Turing (compute capability 7.5).
#19 35.06 If you wish to cross-compile for a single specific architecture,
#19 35.06 export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py.
#19 35.06 
#19 35.06 torch.__version__  =  2.0.0
#19 35.06 
#19 35.06 Compiling cuda extensions with
#19 35.06 nvcc: NVIDIA (R) Cuda compiler driver
#19 35.06 Copyright (c) 2005-2022 NVIDIA Corporation
#19 35.06 Built on Tue_May__3_18:49:52_PDT_2022
#19 35.06 Cuda compilation tools, release 11.7, V11.7.64
#19 35.06 Build cuda_11.7.r11.7/compiler.31294372_0
#19 35.06 from /opt/conda/bin
#19 35.06 
#19 35.06 running install
#19 35.06 /opt/conda/lib/python3.10/site-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
#19 35.06 !!
#19 35.06 
#19 35.06         ********************************************************************************
#19 35.06         Please avoid running ``setup.py`` directly.
#19 35.06         Instead, use pypa/build, pypa/installer or other
#19 35.06         standards-based tools.
#19 35.06 
#19 35.06         See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
#19 35.06         ********************************************************************************
#19 35.06 
#19 35.06 !!
#19 35.06   self.initialize_options()
#19 35.18 /opt/conda/lib/python3.10/site-packages/setuptools/_distutils/cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
#19 35.18 !!
#19 35.18 
#19 35.18         ********************************************************************************
#19 35.18         Please avoid running ``setup.py`` and ``easy_install``.
#19 35.18         Instead, use pypa/build, pypa/installer or other
#19 35.18         standards-based tools.
#19 35.18 
#19 35.18         See https://github.com/pypa/setuptools/issues/917 for details.
#19 35.18         ********************************************************************************
#19 35.18 
#19 35.18 !!
#19 35.18   self.initialize_options()
#19 35.29 running bdist_egg
#19 35.32 running egg_info
#19 35.32 creating apex.egg-info
#19 35.33 writing apex.egg-info/PKG-INFO
#19 35.33 writing dependency_links to apex.egg-info/dependency_links.txt
#19 35.34 writing top-level names to apex.egg-info/top_level.txt
#19 35.34 writing manifest file 'apex.egg-info/SOURCES.txt'
#19 35.45 ERROR setuptools_scm._file_finders.git listing git files failed - pretending there aren't any
#19 35.45 reading manifest file 'apex.egg-info/SOURCES.txt'
#19 35.45 writing manifest file 'apex.egg-info/SOURCES.txt'
#19 35.45 installing library code to build/bdist.linux-x86_64/egg
#19 35.45 running install_lib
#19 35.45 running build_ext
#19 35.47 building 'amp_C' extension
#19 35.47 creating /app/build
#19 35.47 creating /app/build/temp.linux-x86_64-cpython-310
#19 35.47 creating /app/build/temp.linux-x86_64-cpython-310/csrc
#19 35.52 Emitting ninja build file /app/build/temp.linux-x86_64-cpython-310/build.ninja...
#19 35.52 Compiling objects...
#19 35.52 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
#19 35.56 ninja: error: '/app/csrc/amp_C_frontend.cpp', needed by '/app/build/temp.linux-x86_64-cpython-310/csrc/amp_C_frontend.o', missing and no known rule to make it
#19 35.57 Traceback (most recent call last):
#19 35.57   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
#19 35.57     subprocess.run(
#19 35.57   File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
#19 35.57     raise CalledProcessError(retcode, process.args,
#19 35.57 subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
#19 35.57 
#19 35.57 The above exception was the direct cause of the following exception:
#19 35.57 
#19 35.57 Traceback (most recent call last):
#19 35.57   File "/app/./apex/setup.py", line 285, in <module>
#19 35.57     setup(
#19 35.57   File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 103, in setup
#19 35.57     return distutils.core.setup(**attrs)
#19 35.57   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
#19 35.57     return run_commands(dist)
#19 35.57   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
#19 35.57     dist.run_commands()
#19 35.57   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
#19 35.58     self.run_command(cmd)
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command
#19 35.58     super().run_command(command)
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
#19 35.58     cmd_obj.run()
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/command/install.py", line 84, in run
#19 35.58     self.do_egg_install()
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/command/install.py", line 132, in do_egg_install
#19 35.58     self.run_command('bdist_egg')
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
#19 35.58     self.distribution.run_command(command)
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command
#19 35.58     super().run_command(command)
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
#19 35.58     cmd_obj.run()
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/command/bdist_egg.py", line 167, in run
#19 35.58     cmd = self.call_command('install_lib', warn_dir=0)
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/command/bdist_egg.py", line 153, in call_command
#19 35.58     self.run_command(cmdname)
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
#19 35.58     self.distribution.run_command(command)
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command
#19 35.58     super().run_command(command)
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
#19 35.58     cmd_obj.run()
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/command/install_lib.py", line 11, in run
#19 35.58     self.build()
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/command/install_lib.py", line 111, in build
#19 35.58     self.run_command('build_ext')
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
#19 35.58     self.distribution.run_command(command)
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command
#19 35.58     super().run_command(command)
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
#19 35.58     cmd_obj.run()
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 88, in run
#19 35.58     _build_ext.run(self)
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
#19 35.58     self.build_extensions()
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
#19 35.58     build_ext.build_extensions(self)
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
#19 35.58     self._build_extensions_serial()
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
#19 35.58     self.build_extension(ext)
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 249, in build_extension
#19 35.58     _build_ext.build_extension(self, ext)
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
#19 35.58     objects = self.compiler.compile(
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
#19 35.58     _write_ninja_file_and_compile_objects(
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects
#19 35.58     _run_ninja_build(
#19 35.58   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
#19 35.58     raise RuntimeError(message) from e
#19 35.58 RuntimeError: Error compiling objects for extension
#19 ERROR: process "/bin/sh -c export PATH=/opt/conda/pkgs/pytorch-2.0.1-py3.10_cuda11.7_cudnn8.5.0_0/lib/python3.10/site-packages/torch/cuda:$PATH &&     export C_INCLUDE_PATH=${C_INCLUDE_PATH}:/opt/conda/pkgs/pytorch-2.0.1-py3.10_cuda11.7_cudnn8.5.0_0/lib/python3.10/site-packages/torch/include/ &&     conda install -c \"nvidia/label/cuda-11.7.0\" cuda-nvcc &&     conda install -c \"nvidia/label/cuda-11.7.0\" cuda-cudart &&     pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings \"--build-option=--cpp_ext\" --config-settings \"--build-option=--cuda_ext\" ./apex &&     python ./apex/setup.py install --cuda_ext" did not complete successfully: exit code: 1
------
 > [10/17] RUN export PATH=/opt/conda/pkgs/pytorch-2.0.1-py3.10_cuda11.7_cudnn8.5.0_0/lib/python3.10/site-packages/torch/cuda:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin &&     export C_INCLUDE_PATH=${C_INCLUDE_PATH}:/opt/conda/pkgs/pytorch-2.0.1-py3.10_cuda11.7_cudnn8.5.0_0/lib/python3.10/site-packages/torch/include/ &&     conda install -c "nvidia/label/cuda-11.7.0" cuda-nvcc &&     conda install -c "nvidia/label/cuda-11.7.0" cuda-cudart &&     pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./apex &&     python ./apex/setup.py install --cuda_ext:
------
Dockerfile:20
--------------------
  19 |     RUN git clone https://github.com/NVIDIA/apex && cd apex && git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
  20 | >>> RUN export PATH=/opt/conda/pkgs/pytorch-2.0.1-py3.10_cuda11.7_cudnn8.5.0_0/lib/python3.10/site-packages/torch/cuda:$PATH && \
  21 | >>>     export C_INCLUDE_PATH=${C_INCLUDE_PATH}:/opt/conda/pkgs/pytorch-2.0.1-py3.10_cuda11.7_cudnn8.5.0_0/lib/python3.10/site-packages/torch/include/ && \
  22 | >>>     conda install -c "nvidia/label/cuda-11.7.0" cuda-nvcc && \
  23 | >>>     conda install -c "nvidia/label/cuda-11.7.0" cuda-cudart && \
  24 | >>>     pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./apex && \
  25 | >>>     python ./apex/setup.py install --cuda_ext
  26 |     
--------------------
error: failed to solve: rpc error: code = Unknown desc = process "/bin/sh -c export PATH=/opt/conda/pkgs/pytorch-2.0.1-py3.10_cuda11.7_cudnn8.5.0_0/lib/python3.10/site-packages/torch/cuda:$PATH &&     export C_INCLUDE_PATH=${C_INCLUDE_PATH}:/opt/conda/pkgs/pytorch-2.0.1-py3.10_cuda11.7_cudnn8.5.0_0/lib/python3.10/site-packages/torch/include/ &&     conda install -c \"nvidia/label/cuda-11.7.0\" cuda-nvcc &&     conda install -c \"nvidia/label/cuda-11.7.0\" cuda-cudart &&     pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings \"--build-option=--cpp_ext\" --config-settings \"--build-option=--cuda_ext\" ./apex &&     python ./apex/setup.py install --cuda_ext" did not complete successfully: exit code: 1
Error: buildx failed with: error: failed to solve: rpc error: code = Unknown desc = process "/bin/sh -c export PATH=/opt/conda/pkgs/pytorch-2.0.1-py3.10_cuda11.7_cudnn8.5.0_0/lib/python3.10/site-packages/torch/cuda:$PATH &&     export C_INCLUDE_PATH=${C_INCLUDE_PATH}:/opt/conda/pkgs/pytorch-2.0.1-py3.10_cuda11.7_cudnn8.5.0_0/lib/python3.10/site-packages/torch/include/ &&     conda install -c \"nvidia/label/cuda-11.7.0\" cuda-nvcc &&     conda install -c \"nvidia/label/cuda-11.7.0\" cuda-cudart &&     pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings \"--build-option=--cpp_ext\" --config-settings \"--build-option=--cuda_ext\" ./apex &&     python ./apex/setup.py install --cuda_ext" did not complete successfully: exit code: 1