Description

Hi, when trying to install TransformerEngine v1.8, I encountered an invalid redeclaration error, and I do not see this when installing v1.7 or main. Removing the half definition in userbuffers.cu appears to resolve the compilation issue, though I'm not sure what is causing the actual root issue, as I see main/v1.7 has the definition and I was able to install that version, with the same cuDNN version (8.9.7.29).

Any ideas what could be happening here? Thanks!

Install command: MAX_JOBS=16 pip install git+https://github.com/NVIDIA/TransformerEngine.git@v1.8

Error:

 /opt/conda/include/cuda_fp16.hpp(2723): error: invalid redeclaration of type name "nv_bfloat16" (declared at line 2837 of /opt/conda/include/cuda_bf16.hpp)
        typedef __half nv_bfloat16;
                       ^

      1 error detected in the compilation of "/tmp/pip-req-build-_zyeql3z/transformer_engine/pytorch/csrc/userbuffers/userbuffers.cu".

Environment:

PyTorch==2.3.1
CUDA==12.1
cuDNN==8.9.7.29

Type of change

[ ] Documentation change (change only to the documentation, either a fix or a new content)
[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Infra/Build change
[ ] Code refractor

Changes

Please list the changes introduced in this PR:

Removed half definition in userbuffers.cu.

Checklist:

[x] I have read and followed the contributing guidelines
[ ] The functionality is complete
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[ ] My changes generate no new warnings
[ ] I have added tests that prove my fix is effective or that my feature works
[ ] New and existing unit tests pass locally with my changes

NVIDIA / TransformerEngine

[PyTorch] Fix whl build for v1.8 #1050

Description

Type of change

Changes

Checklist: