Error OutOfMemoryError: CUDA out of memory. on 5.5: Start Training

lambertlu commented 1 year ago

Got this error message today: OutOfMemoryError: CUDA out of memory. on 5.5: Start Training.

It was fine yesterday. resolution: (768, 768)

Loading settings from /content/LoRA/config/config_file.toml... /content/LoRA/config/config_file prepare tokenizer Downloading (…)olve/main/vocab.json: 100% 961k/961k [00:00<00:00, 1.14MB/s] Downloading (…)olve/main/merges.txt: 100% 525k/525k [00:00<00:00, 758kB/s] Downloading (…)cial_tokens_map.json: 100% 389/389 [00:00<00:00, 84.3kB/s] Downloading (…)okenizer_config.json: 100% 905/905 [00:00<00:00, 225kB/s] update token length: 225 Load dataset config from /content/LoRA/config/dataset_config.toml prepare images. found directory /content/LoRA/train_data contains 24 image files 240 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 6 resolution: (768, 768) enable_bucket: True min_bucket_reso: 320 max_bucket_reso: 1280 bucket_reso_steps: 64 bucket_no_upscale: False

[Subset 0 of Dataset 0] image_dir: "/content/LoRA/train_data" image_count: 24 num_repeats: 10 shuffle_caption: True keep_tokens: 0 caption_dropout_rate: 0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: mksks style caption_extension: .txt

[Dataset 0] loading image sizes. 100% 24/24 [00:00<00:00, 2216.52it/s] make buckets number of images (including repeats) / 各bucketの画像枚数（繰り返し回数を含む） bucket 0: resolution (768, 768), count: 240 mean ar error (without repeats): 0.0 prepare accelerator Using accelerator 0.15.0 or above. loading model for process 0/1 load StableDiffusion checkpoint loading u-net: loading vae: Downloading (…)lve/main/config.json: 100% 4.52k/4.52k [00:00<00:00, 1.32MB/s] Downloading pytorch_model.bin: 100% 1.71G/1.71G [00:51<00:00, 33.1MB/s] loading text encoder: Replace CrossAttention.forward to use xformers [Dataset 0] caching latents. 100% 6/6 [00:12<00:00, 2.00s/it] import network module: networks.lora create LoRA network. base dim (rank): 32, alpha: 16 create LoRA for Text Encoder: 72 modules. create LoRA for U-Net: 192 modules. enable LoRA for text encoder enable LoRA for U-Net prepare optimizer, data loader etc. CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 7.5 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so... use 8-bit AdamW optimizer | {} override steps. steps for 20 epochs is / 指定エポックまでのステップ数: 800 running training / 学習開始 num train images repeats / 学習画像の数×繰り返し回数: 240 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 40 num epochs / epoch数: 20 batch size per device / バッチサイズ: 6 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 800 steps: 0% 0/800 [00:00<?, ?it/s]epoch 1/20 ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /content/kohya-trainer/train_network.py:752 in │ │ │ │ 749 │ args = parser.parse_args() │ │ 750 │ args = train_util.read_config_from_file(args, parser) │ │ 751 │ │ │ ❱ 752 │ train(args) │ │ 753 │ │ │ │ /content/kohya-trainer/train_network.py:583 in train │ │ │ │ 580 │ │ │ │ │ │ 581 │ │ │ │ # Predict the noise residual │ │ 582 │ │ │ │ with accelerator.autocast(): │ │ ❱ 583 │ │ │ │ │ noise_pred = unet(noisy_latents, timesteps, encode │ │ 584 │ │ │ │ │ │ 585 │ │ │ │ if args.v_parameterization: │ │ 586 │ │ │ │ │ # v-parameterization training │ │ │ │ /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1501 in │ │ _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or s │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hoo │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks │ │ ❱ 1501 │ │ │ return forward_call(args, kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /usr/local/lib/python3.9/dist-packages/accelerate/utils/operations.py:490 in │ │ call │ │ │ │ 487 │ │ update_wrapper(self, model_forward) │ │ 488 │ │ │ 489 │ def call(self, *args, *kwargs): │ │ ❱ 490 │ │ return convert_to_fp32(self.model_forward(args, kwargs)) │ │ 491 │ │ │ 492 │ def getstate(self): │ │ 493 │ │ raise pickle.PicklingError( │ │ │ │ /usr/local/lib/python3.9/dist-packages/torch/amp/autocast_mode.py:14 in │ │ decorate_autocast │ │ │ │ 11 │ @functools.wraps(func) │ │ 12 │ def decorate_autocast(*args, kwargs): │ │ 13 │ │ with autocast_instance: │ │ ❱ 14 │ │ │ return func(*args, *kwargs) │ │ 15 │ decorate_autocast.__script_unsupported = '@autocast() decorator is │ │ 16 │ return decorate_autocast │ │ 17 │ │ │ │ /usr/local/lib/python3.9/dist-packages/diffusers/models/unet_2d_condition.py │ │ :407 in forward │ │ │ │ 404 │ │ │ │ upsample_size = down_block_res_samples[-1].shape[2:] │ │ 405 │ │ │ │ │ 406 │ │ │ if hasattr(upsample_block, "has_cross_attention") and upsa │ │ ❱ 407 │ │ │ │ sample = upsample_block( │ │ 408 │ │ │ │ │ hidden_states=sample, │ │ 409 │ │ │ │ │ temb=emb, │ │ 410 │ │ │ │ │ res_hidden_states_tuple=res_samples, │ │ │ │ /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1501 in │ │ _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or s │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hoo │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks │ │ ❱ 1501 │ │ │ return forward_call(args, kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /usr/local/lib/python3.9/dist-packages/diffusers/models/unet_2d_blocks.py:12 │ │ 02 in forward │ │ │ │ 1199 │ │ │ │ │ create_custom_forward(attn, return_dict=False), h │ │ 1200 │ │ │ │ )[0] │ │ 1201 │ │ │ else: │ │ ❱ 1202 │ │ │ │ hidden_states = resnet(hidden_states, temb) │ │ 1203 │ │ │ │ hidden_states = attn(hidden_states, encoder_hidden_st │ │ 1204 │ │ │ │ 1205 │ │ if self.upsamplers is not None: │ │ │ │ /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1501 in │ │ _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or s │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hoo │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks │ │ ❱ 1501 │ │ │ return forward_call(*args, *kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /usr/local/lib/python3.9/dist-packages/diffusers/models/resnet.py:450 in │ │ forward │ │ │ │ 447 │ def forward(self, input_tensor, temb): │ │ 448 │ │ hidden_states = input_tensor │ │ 449 │ │ │ │ ❱ 450 │ │ hidden_states = self.norm1(hidden_states) │ │ 451 │ │ hidden_states = self.nonlinearity(hidden_states) │ │ 452 │ │ │ │ 453 │ │ if self.upsample is not None: │ │ │ │ /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1501 in │ │ _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or s │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hoo │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks │ │ ❱ 1501 │ │ │ return forward_call(args, *kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_prehooks = [] │ │ │ │ /usr/local/lib/python3.9/dist-packages/torch/nn/modules/normalization.py:273 │ │ in forward │ │ │ │ 270 │ │ │ init.zeros(self.bias) │ │ 271 │ │ │ 272 │ def forward(self, input: Tensor) -> Tensor: │ │ ❱ 273 │ │ return F.group_norm( │ │ 274 │ │ │ input, self.num_groups, self.weight, self.bias, self.eps) │ │ 275 │ │ │ 276 │ def extra_repr(self) -> str: │ │ │ │ /usr/local/lib/python3.9/dist-packages/torch/nn/functional.py:2530 in │ │ group_norm │ │ │ │ 2527 │ if input.dim() < 2: │ │ 2528 │ │ raise RuntimeError(f"Expected at least 2 dimensions for input │ │ 2529 │ _verify_batch_size([input.size(0) input.size(1) // num_groups, │ │ ❱ 2530 │ return torch.group_norm(input, num_groups, weight, bias, eps, tor │ │ 2531 │ │ 2532 │ │ 2533 def local_response_norm(input: Tensor, size: int, alpha: float = 1e-4 │ ╰──────────────────────────────────────────────────────────────────────────────╯ OutOfMemoryError: CUDA out of memory. Tried to allocate 204.00 MiB (GPU 0; 14.75 GiB total capacity; 13.09 GiB already allocated; 30.81 MiB free; 13.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF steps: 0% 0/800 [00:02<?, ?it/s] ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /usr/local/bin/accelerate:8 in │ │ │ │ 5 from accelerate.commands.accelerate_cli import main │ │ 6 if name == 'main': │ │ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 8 │ sys.exit(main()) │ │ 9 │ │ │ │ /usr/local/lib/python3.9/dist-packages/accelerate/commands/accelerate_cli.py │ │ :45 in main │ │ │ │ 42 │ │ exit(1) │ │ 43 │ │ │ 44 │ # Run │ │ ❱ 45 │ args.func(args) │ │ 46 │ │ 47 │ │ 48 if name == "main": │ │ │ │ /usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py:1104 in │ │ launch_command │ │ │ │ 1101 │ elif defaults is not None and defaults.compute_environment == Com │ │ 1102 │ │ sagemaker_launcher(defaults, args) │ │ 1103 │ else: │ │ ❱ 1104 │ │ simple_launcher(args) │ │ 1105 │ │ 1106 │ │ 1107 def main(): │ │ │ │ /usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py:567 in │ │ simple_launcher │ │ │ │ 564 │ process = subprocess.Popen(cmd, env=current_env) │ │ 565 │ process.wait() │ │ 566 │ if process.returncode != 0: │ │ ❱ 567 │ │ raise subprocess.CalledProcessError(returncode=process.return │ │ 568 │ │ 569 │ │ 570 def multi_gpu_launcher(args): │ ╰──────────────────────────────────────────────────────────────────────────────╯ CalledProcessError: Command '['/usr/bin/python3', 'train_network.py', '--sample_prompts=/content/LoRA/config/sample_prompt.txt', '--dataset_config=/content/LoRA/config/dataset_config.toml', '--config_file=/content/LoRA/config/config_file.toml']' returned non-zero exit status 1.

tommyjohn81 commented 1 year ago

reduce you batch size down to 1 then work your way up

nothelloearth1 commented 1 year ago

Reduce your batch size to either 4 or 5

lambertlu commented 1 year ago

Thanks for the reply. I turn it to 3 and it works now. Have a nice day!

Linaqruf / kohya-trainer

Error OutOfMemoryError: CUDA out of memory. on 5.5: Start Training #199