NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.85k stars 309 forks source link

TransformerEngine build fail with Conda #954

Closed TeddLi closed 3 months ago

TeddLi commented 3 months ago

`Requirement already satisfied: mpmath>=0.19 in /workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages (from sympy->torch->transformer_engine==1.6.0+c81733f) (1.3.0) Installing collected packages: transformer_engine Running setup.py develop for transformer_engine error: subprocess-exited-with-error

× python setup.py develop did not run successfully.
│ exit code: 1
╰─> [98 lines of output]
    running develop
    /workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/command/develop.py:40: EasyInstallDeprecationWarning: easy_install command is deprecated.
    !!

            ********************************************************************************
            Please avoid running ``setup.py`` and ``easy_install``.
            Instead, use pypa/build, pypa/installer or other
            standards-based tools.

            See https://github.com/pypa/setuptools/issues/917 for details.
            ********************************************************************************

    !!
      easy_install.initialize_options(self)
    /workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
    !!

            ********************************************************************************
            Please avoid running ``setup.py`` directly.
            Instead, use pypa/build, pypa/installer or other
            standards-based tools.

            See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
            ********************************************************************************

    !!
      self.initialize_options()
    running egg_info
    writing transformer_engine.egg-info/PKG-INFO
    writing dependency_links to transformer_engine.egg-info/dependency_links.txt
    writing requirements to transformer_engine.egg-info/requires.txt
    writing top-level names to transformer_engine.egg-info/top_level.txt
    reading manifest file 'transformer_engine.egg-info/SOURCES.txt'
    adding license file 'LICENSE'
    writing manifest file 'transformer_engine.egg-info/SOURCES.txt'
    running build_ext
    CMake Error at cmake/FindCUDNN.cmake:14 (file):
      file failed to open for reading (Not a directory):

        /workspace/miniconda3/envs/megatron_neo/include/cudnn.h/cudnn_version.h
    Call Stack (most recent call first):
      CMakeLists.txt:24 (find_package)

    CMake Error at cmake/FindCUDNN.cmake:19 (find_library):
      Could not find cudnn_LIBRARY using the following names: cudnn, libcudnn.so.
    Call Stack (most recent call first):
      cmake/FindCUDNN.cmake:41 (find_cudnn_library)
      CMakeLists.txt:24 (find_package)

    -- Configuring incomplete, errors occurred!
    See also "/home/slurm/TransformerEngine/build/cmake/CMakeFiles/CMakeOutput.log".
    Traceback (most recent call last):
      File "/home/slurm/TransformerEngine/setup.py", line 336, in _build_cmake
        subprocess.run(command, cwd=build_dir, check=True)
      File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/subprocess.py", line 526, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['/usr/bin/cmake', '-S', '/home/slurm/TransformerEngine/transformer_engine', '-B', '/home/slurm/TransformerEngine/build/cmake', '-DPython_EXECUTABLE=/workspace/miniconda3/envs/megatron_neo/bin/python', '-DPython_INCLUDE_DIR=/workspace/miniconda3/envs/megatron_neo/include/python3.10', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/home/slurm/TransformerEngine', '-GNinja']' returned non-zero exit status 1.

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "<string>", line 2, in <module>
      File "<pip-setuptools-caller>", line 34, in <module>
      File "/home/slurm/TransformerEngine/setup.py", line 617, in <module>
        main()
      File "/home/slurm/TransformerEngine/setup.py", line 602, in main
        setuptools.setup(
      File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/__init__.py", line 104, in setup
        return distutils.core.setup(**attrs)
      File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 184, in setup
        return run_commands(dist)
      File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 200, in run_commands
        dist.run_commands()
      File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
        self.run_command(cmd)
      File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/dist.py", line 967, in run_command
        super().run_command(command)
      File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
        cmd_obj.run()
      File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/command/develop.py", line 34, in run
        self.install_for_development()
      File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/command/develop.py", line 111, in install_for_development
        self.run_command('build_ext')
      File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 316, in run_command
        self.distribution.run_command(command)
      File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/dist.py", line 967, in run_command
        super().run_command(command)
      File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
        cmd_obj.run()
      File "/home/slurm/TransformerEngine/setup.py", line 368, in run
        ext._build_cmake(
      File "/home/slurm/TransformerEngine/setup.py", line 338, in _build_cmake
        raise RuntimeError(f"Error when running CMake: {e}")
    RuntimeError: Error when running CMake: Command '['/usr/bin/cmake', '-S', '/home/slurm/TransformerEngine/transformer_engine', '-B', '/home/slurm/TransformerEngine/build/cmake', '-DPython_EXECUTABLE=/workspace/miniconda3/envs/megatron_neo/bin/python', '-DPython_INCLUDE_DIR=/workspace/miniconda3/envs/megatron_neo/include/python3.10', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/home/slurm/TransformerEngine', '-GNinja']' returned non-zero exit status 1.
    Building CMake extension transformer_engine
    Running command /usr/bin/cmake -S /home/slurm/TransformerEngine/transformer_engine -B /home/slurm/TransformerEngine/build/cmake -DPython_EXECUTABLE=/workspace/miniconda3/envs/megatron_neo/bin/python -DPython_INCLUDE_DIR=/workspace/miniconda3/envs/megatron_neo/include/python3.10 -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/slurm/TransformerEngine -GNinja
    [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.

error: subprocess-exited-with-error

× python setup.py develop did not run successfully. │ exit code: 1 ╰─> [98 lines of output] running develop /workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/command/develop.py:40: EasyInstallDeprecationWarning: easy_install command is deprecated. !!

        ********************************************************************************
        Please avoid running ``setup.py`` and ``easy_install``.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://github.com/pypa/setuptools/issues/917 for details.
        ********************************************************************************

!!
  easy_install.initialize_options(self)
/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` directly.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
        ********************************************************************************

!!
  self.initialize_options()
running egg_info
writing transformer_engine.egg-info/PKG-INFO
writing dependency_links to transformer_engine.egg-info/dependency_links.txt
writing requirements to transformer_engine.egg-info/requires.txt
writing top-level names to transformer_engine.egg-info/top_level.txt
reading manifest file 'transformer_engine.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'transformer_engine.egg-info/SOURCES.txt'
running build_ext
CMake Error at cmake/FindCUDNN.cmake:14 (file):
  file failed to open for reading (Not a directory):

    /workspace/miniconda3/envs/megatron_neo/include/cudnn.h/cudnn_version.h
Call Stack (most recent call first):
  CMakeLists.txt:24 (find_package)

CMake Error at cmake/FindCUDNN.cmake:19 (find_library):
  Could not find cudnn_LIBRARY using the following names: cudnn, libcudnn.so.
Call Stack (most recent call first):
  cmake/FindCUDNN.cmake:41 (find_cudnn_library)
  CMakeLists.txt:24 (find_package)

-- Configuring incomplete, errors occurred!
See also "/home/slurm/TransformerEngine/build/cmake/CMakeFiles/CMakeOutput.log".
Traceback (most recent call last):
  File "/home/slurm/TransformerEngine/setup.py", line 336, in _build_cmake
    subprocess.run(command, cwd=build_dir, check=True)
  File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/cmake', '-S', '/home/slurm/TransformerEngine/transformer_engine', '-B', '/home/slurm/TransformerEngine/build/cmake', '-DPython_EXECUTABLE=/workspace/miniconda3/envs/megatron_neo/bin/python', '-DPython_INCLUDE_DIR=/workspace/miniconda3/envs/megatron_neo/include/python3.10', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/home/slurm/TransformerEngine', '-GNinja']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 2, in <module>
  File "<pip-setuptools-caller>", line 34, in <module>
  File "/home/slurm/TransformerEngine/setup.py", line 617, in <module>
    main()
  File "/home/slurm/TransformerEngine/setup.py", line 602, in main
    setuptools.setup(
  File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/__init__.py", line 104, in setup
    return distutils.core.setup(**attrs)
  File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 184, in setup
    return run_commands(dist)
  File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 200, in run_commands
    dist.run_commands()
  File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
    self.run_command(cmd)
  File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/dist.py", line 967, in run_command
    super().run_command(command)
  File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
    cmd_obj.run()
  File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/command/develop.py", line 34, in run
    self.install_for_development()
  File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/command/develop.py", line 111, in install_for_development
    self.run_command('build_ext')
  File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 316, in run_command
    self.distribution.run_command(command)
  File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/dist.py", line 967, in run_command
    super().run_command(command)
  File "/workspace/miniconda3/envs/megatron_neo/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
    cmd_obj.run()
  File "/home/slurm/TransformerEngine/setup.py", line 368, in run
    ext._build_cmake(
  File "/home/slurm/TransformerEngine/setup.py", line 338, in _build_cmake
    raise RuntimeError(f"Error when running CMake: {e}")
RuntimeError: Error when running CMake: Command '['/usr/bin/cmake', '-S', '/home/slurm/TransformerEngine/transformer_engine', '-B', '/home/slurm/TransformerEngine/build/cmake', '-DPython_EXECUTABLE=/workspace/miniconda3/envs/megatron_neo/bin/python', '-DPython_INCLUDE_DIR=/workspace/miniconda3/envs/megatron_neo/include/python3.10', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/home/slurm/TransformerEngine', '-GNinja']' returned non-zero exit status 1.
Building CMake extension transformer_engine
Running command /usr/bin/cmake -S /home/slurm/TransformerEngine/transformer_engine -B /home/slurm/TransformerEngine/build/cmake -DPython_EXECUTABLE=/workspace/miniconda3/envs/megatron_neo/bin/python -DPython_INCLUDE_DIR=/workspace/miniconda3/envs/megatron_neo/include/python3.10 -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/slurm/TransformerEngine -GNinja
[end of output]

`

TeddLi commented 3 months ago

Conda don't have cudnn.h at begining. I manually install after pytorch using conda install -c conda-forge cudnn

ywb2018 commented 3 months ago

Conda don't have cudnn.h at begining. I manually install after pytorch using conda install -c conda-forge cudnn

dose it work?

TeddLi commented 3 months ago

Conda don't have cudnn.h at begining. I manually install after pytorch using conda install -c conda-forge cudnn

dose it work?

I solved it by install cudnn manually sudo apt-get -y install cudnn9-cuda-12

timmoon10 commented 3 months ago

When installing cuDNN, you should make sure CUDNN_PATH is set in the environment so that the TE build system can find it. It's a bit inconsistent what environment variables are checked (PyTorch itself is inconsistent between CUDNN_PATH and CUDNN_ROOT, see https://github.com/NVIDIA/TransformerEngine/issues/918#issuecomment-2166583402), so I speculate that Conda isn't setting CUDNN_PATH while APT is.