NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.61k stars 256 forks source link

Error in installing #887

Closed ziyang-arch closed 2 weeks ago

ziyang-arch commented 1 month ago

There are two errors occurred when I tried to install this lib:

  1. the cmake.__file__ in line 81 of setup.py is None so it cause a TypeError when it's used as the cmake path. I can let it bypass this part and use #search in path to find the cmake in my system.
  2. After fixing the cmake path, it shows fatal error: filesystem: No such file or directory.

My system spec is: Linux 3.10.0-1160.102.1.x86_64 x86_64 x86_64 GNU/Linux CUDA 12.3 GPU: Tesla T4 cmake 3.28.0 GCC 11.2.1 devtoolset-11

timmoon10 commented 1 month ago

I think this error is happening because you do not have the cmake module installed but do have a directory called cmake in your path. Calling import cmake loads the wrong thing and cmake.__file__ is None: https://github.com/NVIDIA/TransformerEngine/blob/868c7d301bc2f61ec077884895999569b258f867/setup.py#L77 https://github.com/NVIDIA/TransformerEngine/pull/888 should fix this error.

If you are still running into errors after this, it could be that the build system can't find the CMake executable. In this case you should add the executable to you PATH or install CMake with pip install cmake.

ziyang-arch commented 1 month ago

Thank you! The first error can be fixed in various ways. My major difficulty is the second error. According to #459, GCC after 8.1 should have filesystem header file. In my case, the compiler cannot find this header file with GCC 11.2.1

`[1/9] Building CXX object common/CMakeFiles/transformer_engine.dir/util/cuda_runtime.cpp.o FAILED: common/CMakeFiles/transformer_engine.dir/util/cuda_runtime.cpp.o /usr/bin/c++ -Dtransformer_engine_EXPORTS -I//TransformerEngine/transformer_engine -I//TransformerEngine/transformer_engine/common/include -I//TransformerEngine/transformer_engine/common/../../3rdparty/cudnn-frontend/include -I/TransformerEngine/build/cmake/common/string_headers -isystem /usr/local/cuda-12.3/targets/x86_64-linux/include -O3 -DNDEBUG -std=gnu++1y -fPIC -MD -MT common/CMakeFiles/transformer_engine.dir/util/cuda_runtime.cpp.o -MF common/CMakeFiles/transformer_engine.dir/util/cuda_runtime.cpp.o.d -o common/CMakeFiles/transformer_engine.dir/util/cuda_runtime.cpp.o -c //TransformerEngine/transformer_engine/common/util/cuda_runtime.cpp //TransformerEngine/transformer_engine/common/util/cuda_runtime.cpp:7:22: fatal error: filesystem: No such file or directory

include

                        ^
  compilation terminated.
  [2/9] Building CXX object common/CMakeFiles/transformer_engine.dir/util/cuda_driver.cpp.o
  FAILED: common/CMakeFiles/transformer_engine.dir/util/cuda_driver.cpp.o
  /usr/bin/c++ -Dtransformer_engine_EXPORTS -I/**/TransformerEngine/transformer_engine -I/**/TransformerEngine/transformer_engine/common/include -I/**/TransformerEngine/transformer_engine/common/../../3rdparty/cudnn-frontend/include -I/**/TransformerEngine/build/cmake/common/string_headers -isystem /usr/local/cuda-12.3/targets/x86_64-linux/include -O3 -DNDEBUG -std=gnu++1y -fPIC -MD -MT common/CMakeFiles/transformer_engine.dir/util/cuda_driver.cpp.o -MF common/CMakeFiles/transformer_engine.dir/util/cuda_driver.cpp.o.d -o common/CMakeFiles/transformer_engine.dir/util/cuda_driver.cpp.o -c /**/TransformerEngine/transformer_engine/common/util/cuda_driver.cpp
  /**/TransformerEngine/transformer_engine/common/util/cuda_driver.cpp:8:22: fatal error: filesystem: No such file or directory
   #include <filesystem>
                        ^
  compilation terminated.
  [3/9] Building CXX object common/CMakeFiles/transformer_engine.dir/util/system.cpp.o
  FAILED: common/CMakeFiles/transformer_engine.dir/util/system.cpp.o
  /usr/bin/c++ -Dtransformer_engine_EXPORTS -I/**/TransformerEngine/transformer_engine -I/**/TransformerEngine/transformer_engine/common/include -I/**/TransformerEngine/transformer_engine/common/../../3rdparty/cudnn-frontend/include -I/**/TransformerEngine/build/cmake/common/string_headers -isystem /usr/local/cuda-12.3/targets/x86_64-linux/include -O3 -DNDEBUG -std=gnu++1y -fPIC -MD -MT common/CMakeFiles/transformer_engine.dir/util/system.cpp.o -MF common/CMakeFiles/transformer_engine.dir/util/system.cpp.o.d -o common/CMakeFiles/transformer_engine.dir/util/system.cpp.o -c /**/TransformerEngine/transformer_engine/common/util/system.cpp
  /**/TransformerEngine/transformer_engine/common/util/system.cpp:9:22: fatal error: filesystem: No such file or directory
   #include <filesystem>
                        ^
  compilation terminated.
  [4/9] Building CXX object common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o
  FAILED: common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o
  /usr/bin/c++ -Dtransformer_engine_EXPORTS -I/**/TransformerEngine/transformer_engine -I/**/TransformerEngine/transformer_engine/common/include -I/**/TransformerEngine/transformer_engine/common/../../3rdparty/cudnn-frontend/include -I/**/TransformerEngine/build/cmake/common/string_headers -isystem /usr/local/cuda-12.3/targets/x86_64-linux/include -O3 -DNDEBUG -std=gnu++1y -fPIC -MD -MT common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o -MF common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o.d -o common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o -c /**/TransformerEngine/transformer_engine/common/fused_attn/fused_attn.cpp
  In file included from /**/TransformerEngine/transformer_engine/common/../../3rdparty/cudnn-frontend/include/cudnn_frontend_ConvDesc.h:32:0,
                   from /**/TransformerEngine/transformer_engine/common/../../3rdparty/cudnn-frontend/include/cudnn_frontend.h:102,
                   from /**/TransformerEngine/transformer_engine/common/fused_attn/utils.h:14,
                   from /**/TransformerEngine/transformer_engine/common/fused_attn/fused_attn.cpp:9:
  /**/TransformerEngine/transformer_engine/common/../../3rdparty/cudnn-frontend/include/cudnn_frontend_utils.h:25:20: fatal error: optional: No such file or directory
   #include <optional>
                      ^` 
timmoon10 commented 1 month ago

Strange, filesystem and optional are both in the C++17 standard library and we explicitly specify C++17 support: https://github.com/NVIDIA/TransformerEngine/blob/868c7d301bc2f61ec077884895999569b258f867/transformer_engine/CMakeLists.txt#L11 However, I see C++14 flags (-std=gnu++1y) in your build logs. I wonder if CMake is misconfiguring something.

ziyang-arch commented 4 weeks ago

Yes. In the TransformerEngine/transformer_engine/CMakeLists.txt there are only set(CMAKE_CXX_STANDARD 17), and in the TransformerEngine/build/cmake/build.ninja I noticed for some files the flag becomes -std=gnu++1y and others are still c++17. But I cannot find what caused the use of -std=gnu++1y in ninja.

timmoon10 commented 2 weeks ago

If you are still debugging this, it may be helpful to pass the --verbose flag to pip install. This should print out the CMake build logs, which may give us clues why it's building with C++14 instead of C++17.

ziyang-arch commented 2 weeks ago

Thank you. I have solved this problem with switching from cmake 3.28 to 3.29 and re-cloning the latest repo.