bmaltais / kohya_ss

Apache License 2.0
9.54k stars 1.23k forks source link

Something has changed destroying SDXL training on last comit #1583

Closed oliverban closed 8 months ago

oliverban commented 1 year ago

Hello!

Latest commit breaks all my previous SDXL settings that worked like a charm. Even when putting settings WAY down it tries to allocate too much VRAM for training and goes OOM. I've trained about 10 SDXL Loras at this point so it is NOT a user error.

I get the following error: Traceback (most recent call last): File "C:\Users\Oliver\Documents\Github\kohya_ss\sdxl_train.py", line 753, in train(args) File "C:\Users\Oliver\Documents\Github\kohya_ss\sdxl_train.py", line 574, in train optimizer.step() File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\accelerate\optimizer.py", line 132, in step self.scaler.step(self.optimizer, closure) File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 374, in step retval = self._maybe_opt_step(optimizer, optimizer_state, *args, kwargs) File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 290, in _maybe_opt_step retval = optimizer.step(*args, *kwargs) File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\accelerate\optimizer.py", line 185, in patched_step return method(args, kwargs) File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\torch\optim\lr_scheduler.py", line 69, in wrapper return wrapped(*args, *kwargs) File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\torch\optim\optimizer.py", line 280, in wrapper out = func(args, **kwargs) File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\prodigyopt\prodigy.py", line 166, in step state['p0'] = p.detach().clone()

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 24.00 GiB total capacity; 51.84 GiB already allocated; 0 bytes free; 53.20 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF steps: 0%| | 0/2800 [00:42<?, ?it/s]

Traceback (most recent call last): File "c:\python\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "c:\python\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 986, in launch_command simple_launcher(args) File "C:\Users\Oliver\Documents\Github\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 628, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\Oliver\Documents\Github\kohya_ss\venv\Scripts\python.exe', './sdxl_train.py', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--pretrained_model_name_or_path=C:/Users/Oliver/Documents/Github/stable-diffusion-webui/models/Stable-diffusion/RealitiesEdge Art XL.safetensors', '--train_data_dir=C:/Users/Oliver/Pictures/LoRA Outputs/MoebiusFinished/img', '--resolution=1024,1024', '--output_dir=C:/Users/Oliver/Pictures/LoRA Outputs/MoebiusFinished/model', '--logging_dir=C:/Users/Oliver/Pictures/LoRA Outputs/MoebiusFinished/log', '--save_model_as=safetensors', '--output_name=MoebiusXL', '--lr_scheduler_num_cycles=10', '--max_data_loader_n_workers=0', '--learning_rate=1.0', '--lr_scheduler=cosine', '--train_batch_size=1', '--max_train_steps=2800', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=12345', '--caption_extension=.txt', '--cache_latents', '--cache_latents_to_disk', '--optimizer_type=Prodigy', '--optimizer_args', 'decouple=True', 'weight_decay=0.5', 'betas=0.9,0.99', 'use_bias_correction=False', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--gradient_checkpointing', '--xformers', '--bucket_no_upscale', '--noise_offset=0.0', '--max_grad_norm=0']' returned non-zero exit status 1.

oliverban commented 1 year ago

Yep, something is def up with the new commit. Reverting back to 1e395ed285385a17b39f3190b330220d29bde0ba makes it work like before, now training takes up 21 GB (latest commit with the same settings wants over 56GB) VRAM like before and it is training like it used to. Just wanted to let people know if you hare having issues.

To revert back just:

git reset 1e395ed285385a17b39f3190b330220d29bde0ba