Open oracle9i88 opened 2 months ago
This is a pytorch error, you most likely are using a 2080ti or lower GPU? It would also help to post the full log from the terminal window, as well as a brief description of what type of training you are attempting (Lora, Finetune) for (SD1.5, SDXL, FLUX).
6ffb08b95e23 FLUX: Gradient checkpointing enabled.
6ffb08b95e23 prepare optimizer, data loader etc.
6ffb08b95e23 enable fp8 training for U-Net.
6ffb08b95e23 enable fp8 training for Text Encoder.
6ffb08b95e23 running training / 学習開始
6ffb08b95e23 num train images * repeats / 学習画像の数×繰り返し回数: 720
6ffb08b95e23 num reg images / 正則化画像の数: 0
6ffb08b95e23 num batches per epoch / 1epochのバッチ数: 720
6ffb08b95e23 num epochs / epoch数: 3
6ffb08b95e23 batch size per device / バッチサイズ: 1
6ffb08b95e23 gradient accumulation steps / 勾配を合計するステップ数 = 1
6ffb08b95e23 total optimization steps / 学習ステップ数: 1600
steps: 0%| | 0/1600 [00:00<?, ?it/s]2024-09-15 19:50:08 INFO text_encoder is not needed for training. deleting train_network.py:1033
6ffb08b95e23 to save memory.
unet dtype: torch.float8_e4m3fn, device: cuda:0 train_network.py:1053
6ffb08b95e23 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:672
6ffb08b95e23
6ffb08b95e23 epoch 1/3
6ffb08b95e23 Traceback (most recent call last):
6ffb08b95e23 File "/app/sd-scripts/flux_train_network.py", line 520, in <module>
6ffb08b95e23 trainer.train(args)
6ffb08b95e23 File "/app/sd-scripts/train_network.py", line 1178, in train
6ffb08b95e23 accelerator.backward(loss)
6ffb08b95e23 File "/home/1000/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2155, in backward
6ffb08b95e23 self.scaler.scale(loss).backward(**kwargs)
6ffb08b95e23 File "/home/1000/.local/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
6ffb08b95e23 torch.autograd.backward(
6ffb08b95e23 File "/home/1000/.local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
6ffb08b95e23 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
6ffb08b95e23 RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
steps: 0%| | 0/1600 [00:04<?, ?it/s]
6ffb08b95e23 Traceback (most recent call last):
6ffb08b95e23 File "/home/1000/.local/bin/accelerate", line 8, in <module>
6ffb08b95e23 sys.exit(main())
6ffb08b95e23 File "/home/1000/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
6ffb08b95e23 args.func(args)
6ffb08b95e23 File "/home/1000/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command
6ffb08b95e23 simple_launcher(args)
6ffb08b95e23 File "/home/1000/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher
6ffb08b95e23 raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
6ffb08b95e23 subprocess.CalledProcessError: Command '['/usr/local/bin/python', '/app/sd-scripts/flux_train_network.py', '--config_file', '/app/data/MaxTraining/model/config_lora-20240915-194903.toml']' returned non-zero exit status
I am facing the same issue (at least based on initial description). I am trying to train a FLUX Lora on a RTX 2060 Super on Arch Linux (via Docker). I managed to not run out of memory so far, but training ends with that error. My config: flux_lora_4.json
@eftSharptooth Are 2080ti or lower not supported by Pytorch? My compute Capability is supposed to be 7.5, but I don't know if that is enough.
I was able to train SDXL though. But for that it was using torch=2.1.2+cu118
instead of torch==2.4.0+cu124
, so either Flux training does require more features or the new torch version changed something internally here.
nvidia-smi
Sun Sep 15 22:07:06 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02 Driver Version: 550.107.02 CUDA Version: 12.4
Edit: From https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ it seems like the Turing achitecture is described by SM75. So yes, it is not SM80 or SM90.
Also happens with torch==2.1.2+cu118 torchvision==0.16.2+cu118 xformers==0.0.23.post1+cu118
Workaround by @chenxluo here: https://github.com/bmaltais/kohya_ss/issues/2717#issuecomment-2366769178 Works for me on 2060 Super (Although training ultimately has no effect, but I don't yet know what is causing that)
RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)