Closed oliverban closed 8 months ago
Yep, something is def up with the new commit. Reverting back to 1e395ed285385a17b39f3190b330220d29bde0ba makes it work like before, now training takes up 21 GB (latest commit with the same settings wants over 56GB) VRAM like before and it is training like it used to. Just wanted to let people know if you hare having issues.
To revert back just:
git reset 1e395ed285385a17b39f3190b330220d29bde0ba
Hello!
Latest commit breaks all my previous SDXL settings that worked like a charm. Even when putting settings WAY down it tries to allocate too much VRAM for training and goes OOM. I've trained about 10 SDXL Loras at this point so it is NOT a user error.
I get the following error: Traceback (most recent call last): File "C:\Users\Oliver\Documents\Github\kohya_ss\sdxl_train.py", line 753, in
train(args)
File "C:\Users\Oliver\Documents\Github\kohya_ss\sdxl_train.py", line 574, in train
optimizer.step()
File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\accelerate\optimizer.py", line 132, in step
self.scaler.step(self.optimizer, closure)
File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 374, in step
retval = self._maybe_opt_step(optimizer, optimizer_state, *args, kwargs)
File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 290, in _maybe_opt_step
retval = optimizer.step(*args, *kwargs)
File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\accelerate\optimizer.py", line 185, in patched_step
return method(args, kwargs)
File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\torch\optim\lr_scheduler.py", line 69, in wrapper
return wrapped(*args, *kwargs)
File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\torch\optim\optimizer.py", line 280, in wrapper out = func(args, **kwargs)
File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\prodigyopt\prodigy.py", line 166, in step
state['p0'] = p.detach().clone()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 24.00 GiB total capacity; 51.84 GiB already allocated; 0 bytes free; 53.20 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF steps: 0%| | 0/2800 [00:42<?, ?it/s]
Traceback (most recent call last): File "c:\python\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "c:\python\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in
File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
args.func(args)
File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 986, in launch_command
simple_launcher(args)
File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\Users\Oliver\Documents\Github\kohya_ss\venv\Scripts\python.exe', './sdxl_train.py', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--pretrained_model_name_or_path=C:/Users/Oliver/Documents/Github/stable-diffusion-webui/models/Stable-diffusion/RealitiesEdge Art XL.safetensors', '--train_data_dir=C:/Users/Oliver/Pictures/LoRA Outputs/MoebiusFinished/img', '--resolution=1024,1024', '--output_dir=C:/Users/Oliver/Pictures/LoRA Outputs/MoebiusFinished/model', '--logging_dir=C:/Users/Oliver/Pictures/LoRA Outputs/MoebiusFinished/log', '--save_model_as=safetensors', '--output_name=MoebiusXL', '--lr_scheduler_num_cycles=10', '--max_data_loader_n_workers=0', '--learning_rate=1.0', '--lr_scheduler=cosine', '--train_batch_size=1', '--max_train_steps=2800', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=12345', '--caption_extension=.txt', '--cache_latents', '--cache_latents_to_disk', '--optimizer_type=Prodigy', '--optimizer_args', 'decouple=True', 'weight_decay=0.5', 'betas=0.9,0.99', 'use_bias_correction=False', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--gradient_checkpointing', '--xformers', '--bucket_no_upscale', '--noise_offset=0.0', '--max_grad_norm=0']' returned non-zero exit status 1.