RuntimeError: cuDNN Frontend error: [cudnn_frontend] Error: No execution plans support the graph.

jlest01 commented 1 month ago

I am getting the error below right after starting the training.

[2024-09-14 18:24:18] [INFO] RuntimeError: cuDNN Frontend error: [cudnn_frontend] Error: No execution plans support the graph.
[2024-09-14 18:24:19] [INFO] Traceback (most recent call last):
[2024-09-14 18:24:19] [INFO] File "/home/user/fluxgym/venv/bin/accelerate", line 8, in <module>
[2024-09-14 18:24:19] [INFO] sys.exit(main())

I have an Nvidia 4070 TI 12GB GPU and followed all the manual installation steps correctly. The script is running with the 'venv' environment activated, and both the required dependencies and Nightly PyTorch are installed.

Other settings:

$ nvidia-smi --version
NVIDIA-SMI version  : 550.107.02
NVML version        : 550.107
DRIVER version      : 550.107.02
CUDA Version        : 12.4

Torch version: torch==2.6.0.dev20240914+cu121

OS: Ubuntu 24.01

jlest01 commented 1 month ago

Also tried with the Torch version 2.6.0.dev20240914+cu124, but got the same error.

jlest01 commented 1 month ago

With the latest stable version of Torch, that error is gone, but the log is stuck on the lines below (for hours):

[2024-09-14 19:21:39] [INFO] epoch 1/16
[2024-09-14 19:21:39] [INFO] huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
[2024-09-14 19:21:39] [INFO] To disable this warning, you can either:
[2024-09-14 19:21:39] [INFO] - Avoid using `tokenizers` before the fork if possible
[2024-09-14 19:21:39] [INFO] - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[2024-09-14 19:21:39] [INFO] huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
[2024-09-14 19:21:39] [INFO] To disable this warning, you can either:
[2024-09-14 19:21:39] [INFO] - Avoid using `tokenizers` before the fork if possible
[2024-09-14 19:21:39] [INFO] - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[2024-09-14 19:21:39] [INFO] INFO     epoch is incremented. current_epoch: 0, epoch: 1                                                                                          train_util.py:672
[2024-09-14 19:21:39] [INFO] INFO     epoch is incremented. current_epoch: 0, epoch: 1                                                                                          train_util.py:672
[2024-09-14 19:21:46] [INFO] /home/user/fluxgym/env/lib/python3.12/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
[2024-09-14 19:21:46] [INFO] with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]

adeerkhan commented 1 month ago

I also had this problem, but solved it. I am using CUDA version 12.1, and installed all the requirements as listed but for the pytorch only did this: pip install --pre torch torchvision torchaudio

joecummings commented 1 month ago

@adeerkhan would you like to share how you fixed it? Or is your second comment the solution you found?

timlenardo commented 2 weeks ago

In an, admittedly, very different use-case (I am training Dreambooth models with the diffusers example scripts) I was able to resolve this by updating to cuda 12.6 and the "devel" branch of cudnn, as described this Stackoverflow post.

cocktailpeanut / fluxgym

RuntimeError: cuDNN Frontend error: [cudnn_frontend] Error: No execution plans support the graph. #73