bmaltais / kohya_ss

Apache License 2.0
9.63k stars 1.24k forks source link

flux training on a 2080ti failed #2717

Open chenxluo opened 2 months ago

chenxluo commented 2 months ago

I tried flux training on a 2080ti with 22GB of VRAM, but I keep getting an error:

Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)

It seems to be related to fp8 or sdpa optimization. 2080ti is sm75.I want to know if this means the 2080ti is not supported, or if it’s just a current bug?

chenxluo commented 2 months ago

The log of the training is as follows:

ae = "D:\\ComfyUI-aki-v1.3\\models\\vae\\ae.safetensors" bucket_no_upscale = true bucket_reso_steps = 64 cache_latents = true cache_latents_to_disk = true cache_text_encoder_outputs = true cache_text_encoder_outputs_to_disk = true caption_extension = ".txt" clip_l = "D:\\ComfyUI-aki-v1.3\\models\\clip\\clip_l.safetensors" clip_skip = 1 discrete_flow_shift = 3.0 dynamo_backend = "no" enable_bucket = true epoch = 20 fp8_base = true gradient_accumulation_steps = 1 gradient_checkpointing = true guidance_scale = 3.5 huber_c = 0.1 huber_schedule = "snr" logging_dir = "D:/kohya_ss/logs" loss_type = "l2" lr_scheduler = "cosine_with_restarts" lr_scheduler_args = [] lr_scheduler_num_cycles = 1 lr_scheduler_power = 1 max_bucket_reso = 2048 max_data_loader_n_workers = 0 max_grad_norm = 1 max_timestep = 1000 max_train_epochs = 20 max_train_steps = 7920 min_bucket_reso = 256 min_snr_gamma = 5 mixed_precision = "fp8" model_prediction_type = "raw" network_alpha = 4 network_args = [] network_dim = 4 network_module = "networks.lora_flux" network_train_unet_only = true noise_offset = 0.1 noise_offset_type = "Original" optimizer_args = [] optimizer_type = "Adafactor" output_dir = "D:/kohya_ss/outputs" output_name = "fxT" pretrained_model_name_or_path = "D:/ComfyUI-aki-v1.3/models/unet/flux1DevFp8_v10.safetensors" prior_loss_weight = 1 resolution = "640,640" sample_every_n_epochs = 2 sample_prompts = "D:/kohya_ss/outputs\\sample/prompt.txt" sample_sampler = "euler_a" save_every_n_epochs = 2 save_model_as = "safetensors" save_precision = "fp16" sdpa = true t5xxl = "D:\\ComfyUI-aki-v1.3\\models\\clip\\t5xxl_fp8_e4m3fn.safetensors" t5xxl_max_token_length = 512 timestep_sampling = "sigmoid" train_batch_size = 1 train_blocks = "single" train_data_dir = "D:\\lora训练界面\\lora-scripts\\train\\redtea" unet_lr = 0.002 wandb_run_name = "fxT"

epoch 1/20 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:668 Traceback (most recent call last): File "D:\kohya_ss\sd-scripts\flux_train_network.py", line 408, in <module> trainer.train(args) File "D:\kohya_ss\sd-scripts\train_network.py", line 1129, in train accelerator.backward(loss) File "D:\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 1903, in backward self.scaler.scale(loss).backward(**kwargs) File "D:\kohya_ss\venv\lib\site-packages\torch\_tensor.py", line 492, in backward torch.autograd.backward( File "D:\kohya_ss\venv\lib\site-packages\torch\autograd\__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.) steps: 0%| | 0/7920 [00:02<?, ?it/s] Traceback (most recent call last): File "C:\Users\momery\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\momery\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\kohya_ss\venv\Scripts\accelerate.EXE\__main__.py", line 7, in <module> File "D:\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "D:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "D:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['D:\\kohya_ss\\venv\\Scripts\\python.exe', 'D:/kohya_ss/sd-scripts/flux_train_network.py', '--config_file', 'D:/kohya_ss/outputs/config_lora-20240819-115051.toml', '--mixed_precision', 'fp16', '--offload_optimizer_device', 'cpu', '--offload_param_device', 'cpu']' returned non-zero exit status 1.

chenxluo commented 1 month ago

I modified the sdp related functions to solve this error. But I could only get the result of loss=Nan. In the current version, this problem seems to no longer exist, but unfortunately, I can only get the result of loss=Nan using 2080ti. I tried to delete the cached latent variables and use no half vae, and still got the result of loss=Nan. For comparison, I used the same training set and parameters to train on 4070tis, which was successful. Maybe 2080ti can't train flux? I can only get this frustrating conclusion for now. If anyone has new progress in training flux on 2080ti, please let me know.

maxanier commented 1 month ago

Can you share your modification of the related functions?

I am having the same issue: https://github.com/bmaltais/kohya_ss/issues/2720

chenxluo commented 1 month ago

Can you share your modification of the related functions?

I am having the same issue: https://github.com/bmaltais/kohya_ss/issues/2720

sure. 屏幕截图 2024-09-22 203129 The file path is kohya_ss\sd-scripts\library\flux_models.py

The added code is as follows:

 import xformers.ops
    q, k, v = map(lambda t: rearrange(t, "b h n d -> b n h d"), (q, k, v))
    x = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None)
    x = rearrange(x, "b n h d -> b h n d")

    x = rearrange(x, "B H L D -> B L (H D)")

    return x

When you copy the code, pay attention to the indentation.

maxanier commented 1 month ago

Thank you. Training is running for me with this fix. I also get 480/1600 [43:23<1:41:14, 5.42s/it, avr_loss=nan]. Not sure if this is the same issue as you are having. The modification suggested in https://github.com/kohya-ss/sd-scripts/issues/293#issuecomment-1537365038 for Turing cards, does not help unfortunately.

I am not sure if this does affect training, though. Still wasn't able to train full lora as I ran out of memory after 480 steps now. I am not sure if required VRAM increases with training steps or if there was an unrelated spike in demand/availability. Will keep on trying.

Edit: After stopping plasma and kwin and running from Lora, I successfully run through 3 epochs. However, it seems like the training did not succeed. It does not have any effect when applied. So maybe the avr_loss=NaN is indeed an issue.

maxanier commented 1 month ago

Second attempt was also unsuccessful, I increased rank and alpha to 32, and increased the learning rate. But the produced Lora (after epoch 3) does not have any effect on the output (same picture with trigger word and same seed using strength 0.1 or strength 10). Settings is derived from https://github.com/kohya-ss/sd-scripts/issues/1595#issuecomment-2374086581 which produces Loras with effect.

So I assume the avr_loss=None is indeed an issue. @chenxluo do you think your modification may be related to this, did you test whether it works on your 4070Ti ?