Open tianyan01 opened 2 weeks ago
Hi @tianyan01 , I think the easiest way to try out cuDNN might be the NGC PyTorch containers. For example, nvcr.io/nvidia/pytorch:24.03-py3
has cuDNN 9.0.0, and nvcr.io/nvidia/pytorch:24.04-py3
and nvcr.io/nvidia/pytorch:24.05-py3
have cuDNN 9.1.0. Otherwise, following the instructions here (https://developer.nvidia.com/cudnn) can be another way.
When re-installing TransformerEngine, please clear the build/
or build_tools/build/
directory from previous compilation first before trying to re-install. Also, you can try environment variables like this for custom CUDA or cuDNN paths: CUDA_HOME=/path to cuda/ CUDNN_PATH=/path to cudnn/ pip -v install .
Hi @tianyan01 , I think the easiest way to try out cuDNN might be the NGC PyTorch containers. For example,
nvcr.io/nvidia/pytorch:24.03-py3
has cuDNN 9.0.0, andnvcr.io/nvidia/pytorch:24.04-py3
andnvcr.io/nvidia/pytorch:24.05-py3
have cuDNN 9.1.0. Otherwise, following the instructions here (https://developer.nvidia.com/cudnn) can be another way.When re-installing TransformerEngine, please clear the
build/
orbuild_tools/build/
directory from previous compilation first before trying to re-install. Also, you can try environment variables like this for custom CUDA or cuDNN paths:CUDA_HOME=/path to cuda/ CUDNN_PATH=/path to cudnn/ pip -v install .
Thanks! "CUDNN_PATH=/path to cudnn/ pip -v install ." is worked. During install from main branch, I found a bug in transformer_engine/pytorch/csrc/userbuffers/userbuffers.cu, which was fixed in https://github.com/NVIDIA/TransformerEngine/pull/560, but changed back in https://github.com/NVIDIA/TransformerEngine/pull/757.
I found the fp8 dpa need CuDNN 9.0.1+, so I install pytorch 2.5.0+cu121. And I modify some source code, so I install TransformerEngine from source, but it failed. And I found the log: -- cudnn_adv_infer found at /usr/lib/libcudnn_adv_infer.so. -- cudnn_adv_train found at /usr/lib/libcudnn_adv_train.so. -- cudnn_cnn_infer found at /usr/lib/libcudnn_cnn_infer.so. -- cudnn_cnn_train found at /usr/lib/libcudnn_cnn_train.so. -- cudnn_ops_infer found at /usr/lib/libcudnn_ops_infer.so. -- cudnn_ops_train found at /usr/lib/libcudnn_ops_train.so.
CuDNN 9.0+ lib don't have this *.so. How to fix it?