NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.85k stars 309 forks source link

Installation failed with cmake error #355

Open RuiWang1998 opened 1 year ago

RuiWang1998 commented 1 year ago

Hi,

We are testing our new Hopper machines (H800/H100) and trying to use fp8 for training for the first time, but are having trouble installing TransformerEngine. It reports RuntimeError: Error when running CMake: Command '['/usr/local/bin/cmake', '-S', '/tmp/pip-req-build-p6kjladj/transformer_engine', '-B', '/tmp/tmps08o01xi', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-p6kjladj/build/lib.linux-x86_64-cpython-310', '-GNinja']' returned non-zero exit status 1..

We tried to invoke the command outside of pip and it just reports that there are no source directory.

We are trying docker right now but our internet configuration does not let us use docker very conveniently so we usually would prefer not use it. Could you should us where we could find any clues on how we can proceed? Much appreciated.

ptrendx commented 1 year ago

Hi @RuiWang1998, could you share the command you use for installation and a full error message that you are getting? Thank you!

RuiWang1998 commented 1 year ago

Hi @ptrendx, we used both pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable and pip install git+https://github.com/NVIDIA/TransformerEngine.git@main and tried python version from 3.9 to 3.11. Everytime we simply install pytorch==2.0.1 and packaging and then ran the two commands. They both returned the same error

RuiWang1998 commented 1 year ago

Hi @ptrendx, after a little digging, we think we have located the problem but not sure what's the solution here:

/usr/bin/c++ -Dtransformer_engine_EXPORTS -I/home/rui/TransformerEngine/transformer_engine -I/home/rui/TransformerEngine/transformer_engine/common/include -I/usr/local/cuda-11.8/targets/x86_64-linux/include -I/home/rui/TransformerEngine/transformer_engine/../3rdparty/cudnn-frontend/include -I/tmp/tmp9cj2vyni/common/string_headers -isystem /usr/local/cuda-11.8/include -O3 -DNDEBUG -std=gnu++17 -fPIC -MD -MT common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o -MF common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o.d -o common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o -c /home/rui/TransformerEngine/transformer_engine/common/fused_attn/fused_attn.cpp
In file included from /usr/local/cuda-11.8/include/cuda_fp8.h:350,
                 from /home/rui/TransformerEngine/transformer_engine/common/fused_attn/../common.h:14,
                 from /home/rui/TransformerEngine/transformer_engine/common/fused_attn/fused_attn.cpp:8:
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator short unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:735:16: error: ‘__half2ushort_rz’ was not declared in this scope
  735 |         return __half2ushort_rz(__half(*this));
      |                ^~~~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:744:16: error: ‘__half2uint_rz’ was not declared in this scope
  744 |         return __half2uint_rz(__half(*this));
      |                ^~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator long long unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:753:16: error: ‘__half2ull_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
  753 |         return __half2ull_rz(__half(*this));
      |                ^~~~~~~~~~~~~
      |                __half2_raw
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator short int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:791:16: error: ‘__half2short_rz’ was not declared in this scope
  791 |         return __half2short_rz(__half(*this));
      |                ^~~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:800:16: error: ‘__half2int_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
  800 |         return __half2int_rz(__half(*this));
      |                ^~~~~~~~~~~~~
      |                __half2_raw
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator long long int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:809:16: error: ‘__half2ll_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
  809 |         return __half2ll_rz(__half(*this));
      |                ^~~~~~~~~~~~
      |                __half2_raw
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator short unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1248:16: error: ‘__half2ushort_rz’ was not declared in this scope
 1248 |         return __half2ushort_rz(__half(*this));
      |                ^~~~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1257:16: error: ‘__half2uint_rz’ was not declared in this scope
 1257 |         return __half2uint_rz(__half(*this));
      |                ^~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator long long unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1266:16: error: ‘__half2ull_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
 1266 |         return __half2ull_rz(__half(*this));
      |                ^~~~~~~~~~~~~
      |                __half2_raw
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator short int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1303:16: error: ‘__half2short_rz’ was not declared in this scope
 1303 |         return __half2short_rz(__half(*this));
      |                ^~~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1311:16: error: ‘__half2int_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
 1311 |         return __half2int_rz(__half(*this));
      |                ^~~~~~~~~~~~~
      |                __half2_raw
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator long long int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1319:16: error: ‘__half2ll_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
 1319 |         return __half2ll_rz(__half(*this));
      |                ^~~~~~~~~~~~
      |                __half2_raw
ninja: build stopped: subcommand failed.

Seems like we are missing some headers, where can we include one?

We have machines with CUDA 11.8 and machines with CUDA 12 and we believe they share the same reason here.

RuiWang1998 commented 1 year ago

Hi,

Some updates, our machines with H800 can successfully install now but A100 machines cannot yet. H800 machines just needed CUDNN but A100 machines, even after installation of CUDNN, still meets the error above.

ptrendx commented 1 year ago

Hi, this is a pretty strange error - functions like __half2ushort_rz are declared inside the cuda_fp16.hpp file, which should be in the include directory in your CUDA installation (in this case /usr/local/cuda-11.8/include or /usr/local/cuda-11.8/targets/x86_64-linux/include). Could you confirm that such file exists there?

RuiWang1998 commented 1 year ago

Hi, yes it is in /usr/local/cuda-11.8/include and it seems that __half2ushort_rz is declared there.

MicPie commented 1 year ago

Any update on this issue?

RuiWang1998 commented 1 year ago

Hi, @MicPie ,

We have been able to install this with newer commits now. Were you trying on stable releases?

mahdip72 commented 10 months ago

I have the same problem in my workstation with A6000 ada.

raise RuntimeError(f"Error when running CMake: {e}")
      RuntimeError: Error when running CMake: Command '['/usr/bin/cmake', '-S', '/tmp/pip-req-build-hnl1xnl7/transformer_engine', '-B', '/tmp/tmp6vkf06mc', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-hnl1xnl7/build/lib.linux-x86_64-cpython-311']' returned non-zero exit status 1.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for transformer-engine

@RuiWang1998 Could you help me what should I do? Install CUDNN? Cuda 11.8 pytorch 2.1.0 python 3.11 ubuntu 22.04

RuiWang1998 commented 10 months ago

Hi,

You would have to modify setup.py and make it output the actual error message (maybe by manual input of commands in terminal) s.t. we can know exactly what is going on.

Best, Rui On Nov 21, 2023 at 5:05 PM +0800, mahdip72 @.***>, wrote:

I have the same problem in my workstation with A6000 ada.

raise RuntimeError(f"Error when running CMake: {e}") RuntimeError: Error when running CMake: Command '['/usr/bin/cmake', '-S', '/tmp/pip-req-build-hnl1xnl7/transformer_engine', '-B', '/tmp/tmp6vkf06mc', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-hnl1xnl7/build/lib.linux-x86_64-cpython-311']' returned non-zero exit status 1. [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for transformer-engine

@RuiWang1998https://github.com/RuiWang1998 Could you help me what should I do? Install CUDNN?

— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/TransformerEngine/issues/355#issuecomment-1820503928, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHUU7JFXB74O7EPHGY5HJULYFRVGNAVCNFSM6AAAAAA3CJV7S2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRQGUYDGOJSHA. You are receiving this because you were mentioned.Message ID: @.***>

liuchangdm commented 7 months ago

Hi, @MicPie ,

We have been able to install this with newer commits now. Were you trying on stable releases?

@RuiWang1998 Could you show which release version that you use ? I had the same problems. Thanks.

hellangleZ commented 6 months ago

Same issue

File "/aml2/TransformerEngine/setup.py", line 338, in _build_cmake raise RuntimeError(f"Error when running CMake: {e}") RuntimeError: Error when running CMake: Command '['/aml/conda/bin/cmake', '-S', '/aml2/TransformerEngine/transformer_engine', '-B', '/aml2/TransformerEngine/build/cmake', '-DPython_EXECUTABLE=/aml2/ds2/bin/python', '-DPython_INCLUDE_DIR=/aml2/ds2/include/python3.10', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/aml2/TransformerEngine/build/lib.linux-x86_64-cpython-310', '-GNinja', '-Dpybind11_DIR=/aml2/ds2/lib/python3.10/site-packages/pybind11/share/cmake/pybind11']' returned non-zero exit status 1. [end of output]

timmoon10 commented 6 months ago

The CMake error message should already be printed to stderr, although it is somewhat buried within the Python stacktrace from setup.py. It may be helpful to search for "Building CMake extension transformer_engine" within your build logs.

If the error is happening during CMake configuration, it's probably because CUDA or cuDNN are not properly installed. See CUDA instructions at https://github.com/NVIDIA/TransformerEngine/issues/700#issuecomment-1979377899. For cuDNN, make sure CUDNN_PATH is set in your environment.

BrunoFANG1 commented 5 months ago

I solved this issue by simply use this command

git submodule update --init --recursive

Under the TransformerEngine dir, I hope this might help you.

sfdeggb commented 2 months ago

I also meet the question. the question details information is :

raise RuntimeError(f"Error when running CMake: {e}") RuntimeError: Error when running CMake: Command '['/usr/bin/cmake', '-S', '/tmp/pip-req-build-yvwm9h7r/transformer_engine', '-B', '/tmp/pip-req-build-yvwm9h7r/build/cmake', DPython_EXECUTABLE=/home/ubuntu/train/aconconda/acondada/envs/yuxunlian/bin/python3.1', '-DPython_INCLUDE_DIR=/home/ubuntu/train/aconconda/acondada/envs/yuxunlian/include/python3.11', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-yvwm9h7r/build/lib.linux-x86_64-cpython-311', '-GNinja']' returned non-zero exit status 1.

My environment is below: ubuntu 22.04 cuda:11.7 python: 3.11 torch:2.3.1 nvidia driver:535.183.06 Look forward to a solution!

wplf commented 2 months ago

I also meet the question. the question details information is :

raise RuntimeError(f"Error when running CMake: {e}") RuntimeError: Error when running CMake: Command '['/usr/bin/cmake', '-S', '/tmp/pip-req-build-yvwm9h7r/transformer_engine', '-B', '/tmp/pip-req-build-yvwm9h7r/build/cmake', DPython_EXECUTABLE=/home/ubuntu/train/aconconda/acondada/envs/yuxunlian/bin/python3.1', '-DPython_INCLUDE_DIR=/home/ubuntu/train/aconconda/acondada/envs/yuxunlian/include/python3.11', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-yvwm9h7r/build/lib.linux-x86_64-cpython-311', '-GNinja']' returned non-zero exit status 1.

My environment is below: ubuntu 22.04 cuda:11.7 python: 3.11 torch:2.3.1 nvidia driver:535.183.06 Look forward to a solution!

Hello, my friend! You can check if your nvcc is added to environment.

nvcc --version

If error occurs, you may fix it by export PATH=/usr/local/cuda/bin:$PATH or something like this.

sfdeggb commented 2 months ago

@wplf yeah! my nvcc is seem ok! the information is below:

ubuntu@ip-172-31-38-93:~$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_May__3_18:49:52_PDT_2022 Cuda compilation tools, release 11.7, V11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0 Are there any other solutions?

wplf commented 2 months ago

compiler

Can you check your cmake version?
You can install cmake by pip install cmake

sfdeggb commented 2 months ago

@wplf the cmake version is below:

(yuxunlian) ubuntu@ip-172-31-38-93:~$ cmake --version cmake version 3.22.1 CMake suite maintained and supported by Kitware (kitware.com/cmake).

Is this version appropriate?

wplf commented 2 months ago

@wplf the cmake version is below:

(yuxunlian) ubuntu@ip-172-31-38-93:~$ cmake --version cmake version 3.22.1 CMake suite maintained and supported by Kitware (kitware.com/cmake).

Is this version appropriate?

Yes, this is ok。 Sorry, I can't help you anymore.

sfdeggb commented 2 months ago

@wplf
it does not matter! Thank you for your reply!