bmaltais / kohya_ss

Apache License 2.0
9.66k stars 1.24k forks source link

Some kind of error when attempting to train LORA. #2217

Open NigaKniga opened 7 months ago

NigaKniga commented 7 months ago

04:42:47-667199 INFO Kohya_ss GUI version: v23.0.15 04:42:47-674189 ERROR [WinError 2] The system cannot find the file specified 04:42:47-677169 INFO nVidia toolkit detected 04:42:49-121313 INFO Torch 2.1.2+cu118 04:42:49-135252 INFO Torch backend: nVidia CUDA 11.8 cuDNN 8700 04:42:49-137275 INFO Torch detected GPU: NVIDIA GeForce GTX 1080 Ti VRAM 11264 Arch (6, 1) Cores 28 04:42:49-141264 INFO Python version is 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] 04:42:49-144228 INFO Verifying modules installation status from requirements_windows_torch2.txt... 04:42:49-148237 INFO Verifying modules installation status from requirements.txt... 04:42:51-364290 INFO headless: False Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch(). 04:43:07-021434 INFO Loading config... 04:44:04-488717 INFO Start training Dreambooth... 04:44:04-489695 INFO Validating model file or folder path C:/Users/user/Desktop/Tools/SD auto forge/webui/models/Stable-diffusion/ponyDiffusionV6XL_v6StartWithThisOne.safetensors existence... 04:44:04-491690 INFO ...valid 04:44:04-492687 INFO Validating output_dir path /workspace/kohya_ss/output/SDXL1.0-LoRa_Zeitgeist-Photographic-Style_by-AI_Characters-v2.0 existence... 04:44:04-493685 INFO ...valid 04:44:04-494698 INFO Validating train_data_dir path C:\Users\user\Desktop\LORAimg\Image existence... 04:44:04-495679 INFO ...valid 04:44:04-496677 INFO reg_data_dir not specified, skipping validation 04:44:04-497674 INFO Validating logging_dir path /workspace/kohya_ss/output/SDXL1.0-LoRa_Zeitgeist-Photographic-Style_by-AI_Characters-v2.0 existence... 04:44:04-498671 INFO ...valid 04:44:04-499670 INFO log_tracker_config not specified, skipping validation 04:44:04-500666 INFO resume not specified, skipping validation 04:44:04-501663 INFO vae not specified, skipping validation 04:44:04-502661 INFO dataset_config not specified, skipping validation 04:44:04-505652 INFO Folder 100_subject: steps 132800 04:44:04-506650 INFO max_train_steps (132800 / 3 / 1 50 1) = 2213334 04:44:04-507648 INFO stop_text_encoder_training = 0 04:44:04-508645 INFO lr_warmup_steps = 0 04:44:04-509642 INFO Saving training config to /workspace/kohya_ss/output/SDXL1.0-LoRa_Zeitgeist-Photographic-Style_by-AI_Characters-v2.0\Mint yRyik:Gasai_Yuno_20240407-044404.json... 04:44:04-511638 INFO accelerate launch --num_cpu_threads_per_process=2 "C:\Users\user\kohya_ss/sd-scripts/sdxl_train.py" --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --caption_dropout_rate="0.05" --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --gradient_checkpointing --learning_rate="3e-05" --learning_rate_te1="1e-05" --learning_rate_te2="1e-05" --logging_dir="/workspace/kohya_ss/output/SDXL1.0-LoRa_Zeitgeist-Photographic-Style_by-AI_Chara cters-v2.0" --lr_scheduler="constant" --lr_scheduler_num_cycles="50" --max_data_loader_n_workers="0" --resolution="1024,1024" --max_train_epochs=50 --max_train_steps="2213334" --min_snr_gamma=5 --mixed_precision="fp16" --optimizer_type="AdamW" --output_dir="/workspace/kohya_ss/output/SDXL1.0-LoRa_Zeitgeist-Photographic-Style_by-AI_Charac ters-v2.0" --output_name="LORAtest" --pretrained_model_name_or_path="C:/Users/user/Desktop/Tools/SD auto forge/webui/models/Stable-diffusion/ponyDiffusionV6XL_v6StartWithThisOne.safetensors" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="fp16" --train_batch_size="3" --train_data_dir="C:\Users\user\Desktop\LORAimg\Image" --xformers The following values were not passed to accelerate launch and had defaults used instead: --num_processes was set to a value of 1 --num_machines was set to a value of 1 --mixed_precision was set to a value of 'no' --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' 2024-04-07 04:44:12 INFO prepare tokenizers sdxl_train_util.py:135 INFO Using DreamBooth method. sdxl_train.py:140 2024-04-07 04:44:13 INFO prepare images. train_util.py:1469 INFO found directory C:\Users\user\Desktop\LORAimg\Image\100_subject train_util.py:1432 contains 1328 image files INFO 132800 train images with repeating. train_util.py:1508 INFO 0 reg images. train_util.py:1511 WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1516 INFO [Dataset 0] config_util.py:544 batch_size: 3 resolution: (1024, 1024) enable_bucket: True network_multiplier: 1.0 min_bucket_reso: 256 max_bucket_reso: 2048 bucket_reso_steps: 64 bucket_no_upscale: False

                           [Subset 0 of Dataset 0]
                             image_dir: "C:\Users\user\Desktop\LORAimg\Image\100_Gasai Yuno"
                             image_count: 1328
                             num_repeats: 100
                             shuffle_caption: False
                             keep_tokens: 0
                             keep_tokens_separator:
                             caption_dropout_rate: 0.05
                             caption_dropout_every_n_epoches: 0
                             caption_tag_dropout_rate: 0.0
                             caption_prefix: None
                             caption_suffix: None
                             color_aug: False
                             flip_aug: False
                             face_crop_aug_range: None
                             random_crop: False
                             token_warmup_min: 1,
                             token_warmup_step: 0,
                             is_reg: False
                             class_tokens: Gasai Yuno
                             caption_extension: .txt

                INFO     [Dataset 0]                                                              config_util.py:550
                INFO     loading image sizes.                                                      train_util.py:794

100%|███████████████████████████████████████████████████████████████████████████| 1328/1328 [00:00<00:00, 10823.84it/s] INFO make buckets train_util.py:800 INFO number of images (including repeats) / train_util.py:846 各bucketの画像枚数(繰り返し回数を含む) INFO bucket 0: resolution (896, 1152), count: 100 train_util.py:851 INFO bucket 1: resolution (1024, 1024), count: 200 train_util.py:851 INFO bucket 2: resolution (1088, 960), count: 500 train_util.py:851 INFO bucket 3: resolution (1152, 896), count: 100 train_util.py:851 INFO bucket 4: resolution (1344, 768), count: 131900 train_util.py:851 INFO mean ar error (without repeats): 0.02777957066360671 train_util.py:856 INFO prepare accelerator sdxl_train.py:197 accelerator device: cuda INFO loading model for process 0/1 sdxl_train_util.py:31 INFO load StableDiffusion checkpoint: C:/Users/user/Desktop/Tools/SD auto sdxl_train_util.py:71 forge/webui/models/Stable-diffusion/ponyDiffusionV6XL_v6StartWithThis One.safetensors INFO building U-Net sdxl_model_util.py:192 INFO loading U-Net from checkpoint sdxl_model_util.py:196 2024-04-07 04:44:24 INFO U-Net: sdxl_model_util.py:202 2024-04-07 04:44:25 INFO building text encoders sdxl_model_util.py:205 2024-04-07 04:44:27 INFO loading text encoders from checkpoint sdxl_model_util.py:258 2024-04-07 04:44:28 INFO text encoder 1: sdxl_model_util.py:272 2024-04-07 04:44:32 INFO text encoder 2: sdxl_model_util.py:276 INFO building VAE sdxl_model_util.py:279 INFO loading VAE from checkpoint sdxl_model_util.py:284 2024-04-07 04:44:33 INFO VAE: sdxl_model_util.py:287 Disable Diffusers' xformers INFO Enable xformers for U-Net train_util.py:2529 2024-04-07 04:44:34 INFO [Dataset 0] train_util.py:1948 INFO caching latents. train_util.py:915 INFO checking cache validity... train_util.py:925 100%|████████████████████████████████████████████████████████████████████████████| 1328/1328 [00:00<00:00, 4393.60it/s] INFO caching latents... train_util.py:962 100%|██████████████████████████████████████████████████████████████████████████████| 1328/1328 [20:02<00:00, 1.10it/s] train unet: True, text_encoder1: False, text_encoder2: False number of models: 1 number of trainable parameters: 2567463684 prepare optimizer, data loader etc. 2024-04-07 05:04:37 INFO use AdamW optimizer | {} train_util.py:3819 override steps. steps for 50 epochs is / 指定エポックまでのステップ数: 2213450 running training / 学習開始 num examples / サンプル数: 132800 num batches per epoch / 1epochのバッチ数: 44269 num epochs / epoch数: 50 batch size per device / バッチサイズ: 3 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 2213450 steps: 0%| | 0/2213450 [00:00<?, ?it/s] epoch 1/50 Traceback (most recent call last): File "C:\Users\user\kohya_ss\sd-scripts\sdxl_train.py", line 792, in train(args) File "C:\Users\user\kohya_ss\sd-scripts\sdxl_train.py", line 570, in train noise_pred = unet(noisy_latents, timesteps, text_embedding, vector_embedding) File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\utils\operations.py", line 680, in forward return model_forward(args, kwargs) File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\utils\operations.py", line 668, in call return convert_to_fp32(self.model_forward(*args, kwargs)) File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\amp\autocast_mode.py", line 16, in decorate_autocast return func(*args, *kwargs) File "C:\Users\user\kohya_ss\sd-scripts\library\sdxl_original_unet.py", line 1111, in forward h = call_module(module, h, emb, context) File "C:\Users\user\kohya_ss\sd-scripts\library\sdxl_original_unet.py", line 1095, in call_module x = layer(x, context) File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "C:\Users\user\kohya_ss\sd-scripts\library\sdxl_original_unet.py", line 750, in forward hidden_states = block(hidden_states, context=encoder_hidden_states, timestep=timestep) File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "C:\Users\user\kohya_ss\sd-scripts\library\sdxl_original_unet.py", line 669, in forward output = torch.utils.checkpoint.checkpoint( File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch_compile.py", line 24, in inner return torch._dynamo.disable(fn, recursive)(*args, kwargs) File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch_dynamo\eval_frame.py", line 328, in _fn return fn(*args, *kwargs) File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch_dynamo\external_utils.py", line 17, in inner return fn(args, kwargs) File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py", line 451, in checkpoint return CheckpointFunction.apply(function, preserve, args) File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\autograd\function.py", line 539, in apply return super().apply(args, kwargs) # type: ignore[misc] File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py", line 230, in forward outputs = run_function(args) File "C:\Users\user\kohya_ss\sd-scripts\library\sdxl_original_unet.py", line 665, in custom_forward return func(inputs) File "C:\Users\user\kohya_ss\sd-scripts\library\sdxl_original_unet.py", line 651, in forward_body norm_hidden_states = self.norm2(hidden_states) File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\nn\modules\normalization.py", line 196, in forward return F.layer_norm( File "C:\Users\user\kohya_ss\venv\lib\site-packages\torch\nn\functional.py", line 2543, in layer_norm return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB. GPU 0 has a total capacty of 11.00 GiB of which 0 bytes is free. Of the allocated memory 16.98 GiB is allocated by PyTorch, and 696.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF steps: 0%| | 0/2213450 [04:54<?, ?it/s] Traceback (most recent call last): File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\user\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "C:\Users\user\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\user\kohya_ss\venv\Scripts\python.exe', 'C:\Users\user\kohya_ss/sd-scripts/sdxl_train.py', '--bucket_reso_steps=64', '--cache_latents', '--cache_latents_to_disk', '--caption_dropout_rate=0.05', '--caption_extension=.txt', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--gradient_checkpointing', '--learning_rate=3e-05', '--learning_rate_te1=1e-05', '--learning_rate_te2=1e-05', '--logging_dir=/workspace/kohya_ss/output/SDXL1.0-LoRa_Zeitgeist-Photographic-Style_by-AI_Characters-v2.0', '--lr_scheduler=constant', '--lr_scheduler_num_cycles=50', '--max_data_loader_n_workers=0', '--resolution=1024,1024', '--max_train_epochs=50', '--max_train_steps=2213334', '--min_snr_gamma=5', '--mixed_precision=fp16', '--optimizer_type=AdamW', '--output_dir=/workspace/kohya_ss/output/SDXL1.0-LoRa_Zeitgeist-Photographic-Style_by-AI_Characters-v2.0', '--output_name=LORAtest', '--pretrained_model_name_or_path=C:/Users/user/Desktop/Tools/SD auto forge/webui/models/Stable-diffusion/ponyDiffusionV6XL_v6StartWithThisOne.safetensors', '--save_every_n_epochs=1', '--save_model_as=safetensors', '--save_precision=fp16', '--train_batch_size=3', '--train_data_dir=C:\Users\user\Desktop\LORAimg\Image', '--xformers']' returned non-zero exit status 1.

bmaltais commented 7 months ago

The error log you've provided indicates a CUDA out of memory error during the execution of a training script for a model using PyTorch with CUDA for GPU acceleration. This error occurs when the training process tries to allocate more memory on the GPU than is available. Specifically, the error message states:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB. GPU 0 has a total capacity of 11.00 GiB of which 0 bytes is free. Of the allocated memory 16.98 GiB is allocated by PyTorch, and 696.75 MiB is reserved by PyTorch but unallocated.

This error is a common issue in deep learning tasks, especially when working with large models or large batches of data.