NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.6k stars 255 forks source link

How to install with CuDNN 9.0+ ? #930

Open tianyan01 opened 2 weeks ago

tianyan01 commented 2 weeks ago

I found the fp8 dpa need CuDNN 9.0.1+, so I install pytorch 2.5.0+cu121. And I modify some source code, so I install TransformerEngine from source, but it failed. And I found the log: -- cudnn_adv_infer found at /usr/lib/libcudnn_adv_infer.so. -- cudnn_adv_train found at /usr/lib/libcudnn_adv_train.so. -- cudnn_cnn_infer found at /usr/lib/libcudnn_cnn_infer.so. -- cudnn_cnn_train found at /usr/lib/libcudnn_cnn_train.so. -- cudnn_ops_infer found at /usr/lib/libcudnn_ops_infer.so. -- cudnn_ops_train found at /usr/lib/libcudnn_ops_train.so.

CuDNN 9.0+ lib don't have this *.so. How to fix it?

cyanguwa commented 2 weeks ago

Hi @tianyan01 , I think the easiest way to try out cuDNN might be the NGC PyTorch containers. For example, nvcr.io/nvidia/pytorch:24.03-py3 has cuDNN 9.0.0, and nvcr.io/nvidia/pytorch:24.04-py3 and nvcr.io/nvidia/pytorch:24.05-py3 have cuDNN 9.1.0. Otherwise, following the instructions here (https://developer.nvidia.com/cudnn) can be another way.

When re-installing TransformerEngine, please clear the build/ or build_tools/build/ directory from previous compilation first before trying to re-install. Also, you can try environment variables like this for custom CUDA or cuDNN paths: CUDA_HOME=/path to cuda/ CUDNN_PATH=/path to cudnn/ pip -v install .

tianyan01 commented 2 weeks ago

Hi @tianyan01 , I think the easiest way to try out cuDNN might be the NGC PyTorch containers. For example, nvcr.io/nvidia/pytorch:24.03-py3 has cuDNN 9.0.0, and nvcr.io/nvidia/pytorch:24.04-py3 and nvcr.io/nvidia/pytorch:24.05-py3 have cuDNN 9.1.0. Otherwise, following the instructions here (https://developer.nvidia.com/cudnn) can be another way.

When re-installing TransformerEngine, please clear the build/ or build_tools/build/ directory from previous compilation first before trying to re-install. Also, you can try environment variables like this for custom CUDA or cuDNN paths: CUDA_HOME=/path to cuda/ CUDNN_PATH=/path to cudnn/ pip -v install .

Thanks! "CUDNN_PATH=/path to cudnn/ pip -v install ." is worked. During install from main branch, I found a bug in transformer_engine/pytorch/csrc/userbuffers/userbuffers.cu, which was fixed in https://github.com/NVIDIA/TransformerEngine/pull/560, but changed back in https://github.com/NVIDIA/TransformerEngine/pull/757.