I'm going to use the 4090 to fine-tune the large model of SDXL but I keep saying cuda is insufficient when I drive bisz to 1.

lixida123 commented 1 month ago

running training / 学習開始 num examples / サンプル数: 6420 num batches per epoch / 1epochのバッチ数: 6420 num epochs / epoch数: 1 batch size per device / バッチサイズ: 1 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 3000 steps: 0%| | 0/3000 [00:00<?, ?it/s] epoch 1/1 Traceback (most recent call last): File "/root/lanyun-tmp/webui/kohya/kohya_ss/sd-scripts/sdxl_train.py", line 818, in train(args) File "/root/lanyun-tmp/webui/kohya/kohya_ss/sd-scripts/sdxl_train.py", line 628, in train optimizer.step() File "/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/lib/python3.10/site-packages/accelerate/optimizer.py", line 132, in step self.scaler.step(self.optimizer, closure) File "/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 416, in step retval = self._maybe_opt_step(optimizer, optimizer_state, *args, kwargs) File "/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 315, in _maybe_opt_step retval = optimizer.step(*args, *kwargs) File "/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/lib/python3.10/site-packages/accelerate/optimizer.py", line 185, in patched_step return method(args, kwargs) File "/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper return wrapped(*args, kwargs) File "/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 373, in wrapper out = func(*args, *kwargs) File "/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 297, in step self.init_state(group, p, gindex, pindex) File "/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 478, in init_state state["state1"] = self.get_state_buffer(p, dtype=torch.uint8) File "/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 337, in get_state_buffer return torch.zeros_like(p, dtype=dtype, device=p.device) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 23.64 GiB of which 14.81 MiB is free. Process 378976 has 23.62 GiB memory in use. Of the allocated memory 22.68 GiB is allocated by PyTorch, and 464.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF steps: 0%| | 0/3000 [00:01<?, ?it/s] Traceback (most recent call last): File "/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/bin/accelerate", line 8, in sys.exit(main()) File "/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1017, in launch_command simple_launcher(args) File "/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/root/lanyun-tmp/webui/kohya/kohya_ss/myenv/bin/python3', '/root/lanyun-tmp/webui/kohya/kohya_ss/sd-scripts/sdxl_train.py', '--config_file', '/root/lanyun-tmp/webui/config_dreambooth-20240714-125801.toml']' returned non-zero exit status 1. 12:58:59-972829 INFO Training has ended.

b-fission commented 1 month ago

You'll need to enable "Gradient checkpointing" and "Full fp16 training" to train on 24gb of vram.

lixida123 commented 1 month ago

您需要启用“梯度检查点”和“完整 fp16 训练”才能在 24GB 的 vram 上进行训练。

Thanks for answering. It's back to normal

bmaltais / kohya_ss

I'm going to use the 4090 to fine-tune the large model of SDXL but I keep saying cuda is insufficient when I drive bisz to 1. #2644