after update, i have outofmemoryError(CUDA out of memory) on 4090

dalcefo commented 1 year ago

Hello. I don't know much about coding. But so far, I have been making many models through kohya_ss.

But until now, it's the first time I've seen something like this, so I'm embarrassed. Currently, my GPU is 4090, and I don't think the performance of other PCs is lacking.

Unsurprisingly, I've never had this problem with training with kohya_SS until now.

After this update, this problem occurs in training based on sd2.1, i.e. training using v2 parameters. sd1.5 training goes well.

And even though I've made so many models, this is the first time I've encountered this problem.

I'm looking for help because I'm having a setback at work. Can someone help me by looking at this alien language-like code?

Here is my current virtual environment: Version: 4275c12 Fri Jun 23 13:23:25 2023 -0400 nVidia toolkit detected Torch 2.0.1+cu118 Torch backend: nVidia CUDA 11.8 cuDNN 8700 Torch detected GPU: NVIDIA GeForce RTX 4090 VRAM 24564 Arch (8, 9) Cores 128

A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' prepare tokenizer update token length: 225 prepare images. found directory C:\train\photo-ricepaper\img\50_dal 1girl contains 159 image files found directory C:\train\photo-ricepaper\img\50_dalcefo watermark contains 1 image files 8000 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 1 resolution: (768, 768) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 1024 bucket_reso_steps: 64 bucket_no_upscale: False

[Subset 0 of Dataset 0] image_dir: "C:\train\photo-ricepaper\img\50_dal 1girl" image_count: 159 num_repeats: 50 shuffle_caption: False keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: dal 1girl caption_extension: .txt

[Subset 1 of Dataset 0] image_dir: "C:\train\photo-ricepaper\img\50_dalcefo watermark" image_count: 1 num_repeats: 50 shuffle_caption: False keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: dalcefo watermark caption_extension: .txt

[Dataset 0] loading image sizes. 100%|██████████████████████████████████████████████████████████████████████████████| 160/160 [00:00<00:00, 9999.83it/s] make buckets number of images (including repeats) / 各bucketの画像枚数（繰り返し回数を含む） bucket 0: resolution (640, 896), count: 6050 bucket 1: resolution (768, 768), count: 50 bucket 2: resolution (896, 640), count: 1900 mean ar error (without repeats): 0.059480368275293204 prepare accelerator C:\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py:258: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead. warnings.warn( Using accelerator 0.15.0 or above. loading model for process 0/1 load StableDiffusion checkpoint: C:/sd/webui/models/Stable-diffusion/2.1/zPhyrInsanity21768_zphyrinsanity.safetensors C:\kohya_ss\venv\lib\site-packages\safetensors\torch.py:98: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() with safe_open(filename, framework="pt", device=device) as f: loading u-net: loading vae: loading text encoder: CrossAttention.forward has been replaced to enable xformers. [Dataset 0] caching latents. 100%|████████████████████████████████████████████████████████████████████████████████| 160/160 [00:23<00:00, 6.71it/s] prepare optimizer, data loader etc. use AdamW optimizer | {} running training / 学習開始 num train images repeats / 学習画像の数×繰り返し回数: 8000 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 8000 num epochs / epoch数: 2 batch size per device / バッチサイズ: 1 total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: 1 gradient ccumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 16000 steps: 0%| | 0/16000 [00:00<?, ?it/s] epoch 1/2 ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ C:\kohya_ss\train_db.py:486 in │ │ │ │ 483 │ args = parser.parse_args() │ │ 484 │ args = train_util.read_config_from_file(args, parser) │ │ 485 │ │ │ ❱ 486 │ train(args) │ │ 487 │ │ │ │ C:\kohya_ss\train_db.py:350 in train │ │ │ │ 347 │ │ │ │ │ │ params_to_clip = unet.parameters() │ │ 348 │ │ │ │ │ accelerator.clip_gradnorm(params_to_clip, args.max_grad_norm) │ │ 349 │ │ │ │ │ │ ❱ 350 │ │ │ │ optimizer.step() │ │ 351 │ │ │ │ lr_scheduler.step() │ │ 352 │ │ │ │ optimizer.zero_grad(set_to_none=True) │ │ 353 │ │ │ │ C:\kohya_ss\venv\lib\site-packages\accelerate\optimizer.py:134 in step │ │ │ │ 131 │ │ │ │ xm.optimizer_step(self.optimizer, optimizer_args=optimizer_args) │ │ 132 │ │ │ elif self.scaler is not None: │ │ 133 │ │ │ │ scale_before = self.scaler.get_scale() │ │ ❱ 134 │ │ │ │ self.scaler.step(self.optimizer, closure) │ │ 135 │ │ │ │ self.scaler.update() │ │ 136 │ │ │ │ scale_after = self.scaler.get_scale() │ │ 137 │ │ │ │ # If we reduced the loss scale, it means the optimizer step was skipped │ │ │ │ C:\kohya_ss\venv\lib\site-packages\torch\cuda\amp\grad_scaler.py:374 in step │ │ │ │ 371 │ │ │ │ 372 │ │ assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were rec │ │ 373 │ │ │ │ ❱ 374 │ │ retval = self._maybe_opt_step(optimizer, optimizer_state, args, kwargs) │ │ 375 │ │ │ │ 376 │ │ optimizer_state["stage"] = OptState.STEPPED │ │ 377 │ │ │ │ C:\kohya_ss\venv\lib\site-packages\torch\cuda\amp\grad_scaler.py:290 in _maybe_opt_step │ │ │ │ 287 │ def _maybe_opt_step(self, optimizer, optimizer_state, *args, *kwargs): │ │ 288 │ │ retval = None │ │ 289 │ │ if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()): │ │ ❱ 290 │ │ │ retval = optimizer.step(args, kwargs) │ │ 291 │ │ return retval │ │ 292 │ │ │ 293 │ def step(self, optimizer, *args, kwargs): │ │ │ │ C:\kohya_ss\venv\lib\site-packages\torch\optim\lr_scheduler.py:69 in wrapper │ │ │ │ 66 │ │ │ │ instance = instance_ref() │ │ 67 │ │ │ │ instance._step_count += 1 │ │ 68 │ │ │ │ wrapped = func.get(instance, cls) │ │ ❱ 69 │ │ │ │ return wrapped(*args, *kwargs) │ │ 70 │ │ │ │ │ 71 │ │ │ # Note that the returned function here is no longer a bound method, │ │ 72 │ │ │ # so attributes like __func__ and __self__ no longer exist. │ │ │ │ C:\kohya_ss\venv\lib\site-packages\torch\optim\optimizer.py:280 in wrapper │ │ │ │ 277 │ │ │ │ │ │ │ raise RuntimeError(f"{func} must return None or a tuple of ( │ │ 278 │ │ │ │ │ │ │ │ │ │ │ f"but got {result}.") │ │ 279 │ │ │ │ │ │ ❱ 280 │ │ │ │ out = func(args, kwargs) │ │ 281 │ │ │ │ self._optimizer_step_code() │ │ 282 │ │ │ │ │ │ 283 │ │ │ │ # call optimizer step post hooks │ │ │ │ C:\kohya_ss\venv\lib\site-packages\torch\optim\optimizer.py:33 in _use_grad │ │ │ │ 30 │ │ prev_grad = torch.is_grad_enabled() │ │ 31 │ │ try: │ │ 32 │ │ │ torch.set_grad_enabled(self.defaults['differentiable']) │ │ ❱ 33 │ │ │ ret = func(self, *args, **kwargs) │ │ 34 │ │ finally: │ │ 35 │ │ │ torch.set_grad_enabled(prev_grad) │ │ 36 │ │ return ret │ │ │ │ C:\kohya_ss\venv\lib\site-packages\torch\optim\adamw.py:171 in step │ │ │ │ 168 │ │ │ │ state_steps, │ │ 169 │ │ │ ) │ │ 170 │ │ │ │ │ ❱ 171 │ │ │ adamw( │ │ 172 │ │ │ │ params_with_grad, │ │ 173 │ │ │ │ grads, │ │ 174 │ │ │ │ exp_avgs, │ │ │ │ C:\kohya_ss\venv\lib\site-packages\torch\optim\adamw.py:321 in adamw │ │ │ │ 318 │ else: │ │ 319 │ │ func = _single_tensor_adamw │ │ 320 │ │ │ ❱ 321 │ func( │ │ 322 │ │ params, │ │ 323 │ │ grads, │ │ 324 │ │ exp_avgs, │ │ │ │ C:\kohya_ss\venv\lib\site-packages\torch\optim\adamw.py:566 in _multi_tensor_adamw │ │ │ │ 563 │ │ │ else: │ │ 564 │ │ │ │ exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs) │ │ 565 │ │ │ │ torch._foreachdiv(exp_avg_sq_sqrt, bias_correction2_sqrt) │ │ ❱ 566 │ │ │ │ denom = torch._foreach_add(exp_avg_sq_sqrt, eps) │ │ 567 │ │ │ │ │ 568 │ │ │ torch._foreachaddcdiv(device_params, device_exp_avgs, denom, step_size) │ │ 569 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 23.99 GiB total capacity; 22.81 GiB already allocated; 0 bytes free; 22.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF steps: 0%| | 0/16000 [00:01<?, ?it/s] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ C:\Users\AE230003003\AppData\Local\Programs\Python\Python310\lib\runpy.py:196 in │ │ _run_module_as_main │ │ │ │ 193 │ main_globals = sys.modules["main"].dict │ │ 194 │ if alter_argv: │ │ 195 │ │ sys.argv[0] = mod_spec.origin │ │ ❱ 196 │ return _run_code(code, main_globals, None, │ │ 197 │ │ │ │ │ "main", mod_spec) │ │ 198 │ │ 199 def run_module(mod_name, init_globals=None, │ │ │ │ C:\Users\AE230003003\AppData\Local\Programs\Python\Python310\lib\runpy.py:86 in _run_code │ │ │ │ 83 │ │ │ │ │ loader = loader, │ │ 84 │ │ │ │ │ package = pkg_name, │ │ 85 │ │ │ │ │ spec = mod_spec) │ │ ❱ 86 │ exec(code, run_globals) │ │ 87 │ return run_globals │ │ 88 │ │ 89 def _run_module_code(code, init_globals=None, │ │ │ │ in :7 │ │ │ │ 4 from accelerate.commands.accelerate_cli import main │ │ 5 if name == 'main': │ │ 6 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 7 │ sys.exit(main()) │ │ 8 │ │ │ │ C:\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py:45 in main │ │ │ │ 42 │ │ exit(1) │ │ 43 │ │ │ 44 │ # Run │ │ ❱ 45 │ args.func(args) │ │ 46 │ │ 47 │ │ 48 if name == "main": │ │ │ │ C:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py:918 in launch_command │ │ │ │ 915 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │ │ 916 │ │ sagemaker_launcher(defaults, args) │ │ 917 │ else: │ │ ❱ 918 │ │ simple_launcher(args) │ │ 919 │ │ 920 │ │ 921 def main(): │ │ │ │ C:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py:580 in simple_launcher │ │ │ │ 577 │ process.wait() │ │ 578 │ if process.returncode != 0: │ │ 579 │ │ if not args.quiet: │ │ ❱ 580 │ │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │ │ 581 │ │ else: │ │ 582 │ │ │ sys.exit(1) │ │ 583 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ CalledProcessError: Command '['C:\kohya_ss\venv\Scripts\python.exe', 'train_db.py', '--v2', '--v_parameterization', '--enable_bucket', '--pretrained_model_name_or_path=C:/sd/webui/models/Stable-diffusion/2.1/zPhyrInsanity21768_zphyrinsanity.safetensors', '--train_data_dir=C:/train/photo-ricepaper/img', '--resolution=768,768', '--output_dir=C:/train/photo-ricepaper', '--logging_dir=C:/train/photo-ricepaper', '--save_model_as=safetensors', '--output_name=dalcefo_photo-ricepaper-21', '--max_token_length=225', '--max_data_loader_n_workers=0', '--learning_rate=3e-06', '--lr_scheduler=constant_with_warmup', '--lr_warmup_steps=1600', '--train_batch_size=1', '--max_train_steps=16000', '--save_every_n_epochs=2', '--mixed_precision=fp16', '--save_precision=fp16', '--caption_extension=.txt', '--cache_latents', '--optimizer_type=AdamW', '--max_data_loader_n_workers=0', '--max_token_length=225', '--bucket_reso_steps=64', '--gradient_checkpointing', '--xformers']' returned non-zero exit status 1.

bmaltais commented 1 year ago

You might want to revert back to the previous release that worked for you... It is possible some new code might be causing the unexpected issue... At least you can go back to the previous release for a while until a new release fix this issue.

dalcefo commented 1 year ago

ok, thanks. but witch previous release could be works well in my environment? It's been a while since I did an update, and I'm really ignorant about these code issues. It's cumbersome, but can you give me a recommendation? The previous version suffered an anomaly where training was many times slower than normal. As always thank you for your work.

bmaltais commented 1 year ago

Hard to rell... but you could try with:

git checkout v20.8.2

moxSedai commented 1 year ago

There's no gui.sh in v20.8.2

dalcefo commented 1 year ago

git checkout v20.8.2

thank you. I can't do anything about it anyway, so I'll try. I appreciate everything you do.

ruiboy1919 commented 1 year ago

Run the command taskkill /im cmd.exe in CMD and try to free GPU memory. This problem happens often. If you are unsure, try searching for this command.

moxSedai commented 1 year ago

Still a major issue on Linux.

dfghsderftgerdf commented 1 year ago

try this it helped me, I had the same issue download the latest cuda toolkit:

https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exe_local

dalcefo commented 1 year ago

try this it helped me, I had the same issue download the latest cuda toolkit:

https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exe_local

you are great! it works!

bmaltais / kohya_ss

after update, i have outofmemoryError(CUDA out of memory) on 4090 #1048