NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.85k stars 309 forks source link

[PyTorch] Fix whl build for v1.8 #1050

Closed viclzhu closed 2 months ago

viclzhu commented 2 months ago

Description

Hi, when trying to install TransformerEngine v1.8, I encountered an invalid redeclaration error, and I do not see this when installing v1.7 or main. Removing the half definition in userbuffers.cu appears to resolve the compilation issue, though I'm not sure what is causing the actual root issue, as I see main/v1.7 has the definition and I was able to install that version, with the same cuDNN version (8.9.7.29).

Any ideas what could be happening here? Thanks!

Install command: MAX_JOBS=16 pip install git+https://github.com/NVIDIA/TransformerEngine.git@v1.8

Error:

 /opt/conda/include/cuda_fp16.hpp(2723): error: invalid redeclaration of type name "nv_bfloat16" (declared at line 2837 of /opt/conda/include/cuda_bf16.hpp)
        typedef __half nv_bfloat16;
                       ^

      1 error detected in the compilation of "/tmp/pip-req-build-_zyeql3z/transformer_engine/pytorch/csrc/userbuffers/userbuffers.cu".

Environment:

Type of change

Changes

Please list the changes introduced in this PR:

Checklist:

timmoon10 commented 2 months ago

Thanks for the bugfix, but this should already be fixed with https://github.com/NVIDIA/TransformerEngine/pull/949 (included in the main and release_v1.9 branches). See https://github.com/NVIDIA/TransformerEngine/pull/560 for a more detailed description of the bug.