Open sirutBuasai opened 8 months ago
Hi @sirutBuasai, what is the cuDNN version you are using?
CuDNN installed with torch==2.1.2
is 8.9.2
(megatron_bench) ubuntu@ip-10-0-0-88:~$ python -c "import torch;print(torch.backends.cudnn.version())"
8902
Hi @sirutBuasai , could you try upgrading to cuDNN 8.9.7+ please?
Will do, in the meantime, is there a TE version that is built with CuDNN 8.9.2?
I think it's probably v0.10, but I'd rather you roll forward with cuDNN than backward with TE. There's been a lot of development in the last year or so. If it's easier, you can use the NGC pytorch container, which has the latest TE (1.3) and cuDNN (9.0): nvcr.io/nvidia/pytorch:24.01-py3
@cyanguwa I think we still should catch this error from cuDNN Frontend and just disable cuDNN's implementation of attention in this case.
@sirutBuasai Was your problem solved? Could you tell me the solution. I meet the same problem.
@liu21yd, We ended up using TE v0.10 but it is pretty old. I haven't tried upgrading CuDNN and TE together but that would be a place to start.
Recently we observed similar issues with any combinations of TE 1.4/1.7 and cuDNN 8.9.4/8.9.7. In our cases, the fused_attn test in this repository also fails, as well as the frontend toolkit (Megatron-LM) doesn't work.
Note that our operating system is Rocky, not Debian-ish ones.
For a workaround we eventually set NVTE_FUSED_ATTN=0
to disable fused attention kernels, then the issue went away.
Hi, we are currently running into TransformerEngine related error when running GPT model on H100 GPU (AWS p5.48xlarge). Below is the error log
Error:
Steps to reproduce:
conda env create -f megatron_bench.yml
andconda activate megatron_bench
install_deps.sh
.train.sh
.train.sh
.megatron_bench.yml
:install_deps.sh
:train.sh
: