NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.61k stars 256 forks source link

Compiling on Slurmcluster fatal error: cudnn.h: No such file or directory #918

Open windprak opened 3 weeks ago

windprak commented 3 weeks ago

I try to compile TE on a slurmcluster because containers aren't fully supported (MPI issues). My setup is like this:


module load cuda/12.4.1
module load cmake/3.23.1 
module load git/2.35.2 
module load gcc/12.1.0
module load cudnn/9.1.0.70-12.x

source $WORK/venvs/megatron/bin/activate
python -m pip install --force-reinstall setuptools==69.5.1.
python -m pip install nltk sentencepiece einops mpmath packaging numpy ninja wheel
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124
pip install wheel
MAX_JOBS=4 pip install flash-attn==2.4.2. --no-build-isolation
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable

export CXXFLAGS=-isystem\ $CUDNN_ROOT/include
pip install git+https://github.com/NVIDIA/TransformerEngine.git@main  #or stable doesn't matter

All the variables echo well. I can build megatron-lm and apex in this environment, no problem. But not TE.

Error:

conda/envs/megatron/lib/python3.10/site-packages/torch/include/ATen/cudnn/cudnn-wrapper.h:3:10: fatal error: cudnn.h: No such file or directory
          3 | #include <cudnn.h>
            |          ^~~~~~~~~
timmoon10 commented 3 weeks ago

It looks like PyTorch's C++ extensions are configured with CUDNN_HOME or CUDNN_PATH: https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/torch/utils/cpp_extension.py#L209 PyTorch's build is configured with CUDNN_ROOT: https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/cmake/Modules_CUDA_fix/FindCUDNN.cmake#L4

ywb2018 commented 1 week ago

It looks like PyTorch's C++ extensions are configured with CUDNN_HOME or CUDNN_PATH: https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/torch/utils/cpp_extension.py#L209 PyTorch's build is configured with CUDNN_ROOT: https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/cmake/Modules_CUDA_fix/FindCUDNN.cmake#L4

so what i can do to handle this issue? please give a clear and simple answer thx!

timmoon10 commented 1 week ago
export CUDNN_PATH=/path/to/cudnn
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable