Linaqruf / kohya-trainer

Adapted from https://note.com/kohya_ss/n/nbf7ce8d80f29 for easier cloning
Apache License 2.0
1.86k stars 308 forks source link

kohya fast trainer error at "Start Training" #115

Closed Zexia1 closed 1 year ago

Zexia1 commented 1 year ago

i managed to used my local gdrive as a directory for the dataset to train, prob missed on something but can't figuire it out

Reading package lists... Building dependency tree... Reading state information... aria2 is already the newest version (1.35.0-1build1). liblz4-tool is already the newest version (1.9.2-2ubuntu0.20.04.1). 0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded. Preparing metadata (setup.py) ... done Building wheel for library (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.0/134.0 MB 8.5 MB/s eta 0:00:00 Download Progress Summary as of Sun Mar 5 14:50:08 2023

[#e5c326 2.5GiB/3.9GiB(63%) CN:16 DL:243MiB ETA:6s] FILE: /content/pretrained_model/anything-v3-fp32-pruned.safetensors

Download Results: gid |stat|avg speed |path/URI ======+====+===========+======================================================= e5c326|OK | 237MiB/s|/content/pretrained_model/anything-v3-fp32-pruned.safetensors

Status Legend: (OK):download completed.

Download Results: gid |stat|avg speed |path/URI ======+====+===========+======================================================= dc7d07|OK | 0B/s|/content/vae/anime.vae.pt

Status Legend: (OK):download completed. Skipping directory 10_vestia_zeta 2023-03-05 14:50:15.529319: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-03-05 14:50:16.593459: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia 2023-03-05 14:50:16.593633: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia 2023-03-05 14:50:16.593659: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. using existing wd14 tagger model found 0 images. loading model and labels 2023-03-05 14:50:23.731966: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. 2023-03-05 14:50:47.211466: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 37203968 exceeds 10% of free system memory. WARNING:tensorflow:No training configuration found in save file, so the model was not compiled. Compile it manually. 0it [00:00, ?it/s] done! 2023-03-05 14:50:58.485195: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-03-05 14:50:59.246223: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2023-03-05 14:50:59.246355: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2023-03-05 14:50:59.246374: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. load images from /content/drive/MyDrive/LoRA/zeta found 0 images. loading BLIP caption: https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth BLIP loaded 0it [00:00, ?it/s] done! --2023-03-05 14:51:31-- https://raw.githubusercontent.com/Stability-AI/stablediffusion/main/configs/stable-diffusion/v2-inference-v.yaml Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 416 Range Not Satisfiable

The file is already fully retrieved; nothing to do.

File successfully downloaded +--------------------------+---------------------------------------------------------------+ | Hyperparameter | Value | +--------------------------+---------------------------------------------------------------+ | mode | LoRA | | use_dreambooth_method | True | | lowram | True | | v2 | True | | v_parameterization | True | | project_name | vestia_zeta | | modelPath | /content/pretrained_model/anything-v3-fp32-pruned.safetensors | | vaePath | /content/vae/anime.vae.pt | | train_data_dir | /content/drive/MyDrive/LoRA/zeta | | reg_data_dir | /content/drive/MyDrive/LoRA/reg_data | | output_dir | /content/drive/MyDrive/training_dir/output | | network_dim | 128 | | network_alpha | 128 | | network_weights | False | | unet_lr | 0.0001 | | text_encoder_lr | 5e-05 | | optimizer_type | AdamW8bit | | optimizer_args | False | | learning_rate | 2e-06 | | lr_scheduler | constant | | lr_warmup_steps | 250 | | lr_scheduler_args | 1 | | keep_tokens | 1 | | min_bucket_reso | 256 | | max_bucket_reso | 1024 | | resolution | 512 | | caption_extension | .txt | | noise_offset | 0 | | prior_loss_weight | 1.0 | | mixed_precision | fp16 | | save_precision | fp16 | | save_n_epochs_type | save_n_epoch_ratio | | save_n_epochs_type_value | 3 | | save_model_as | safetensors | | train_batch_size | 4 | | max_train_type | max_train_epochs | | max_train_type_value | 20 | | clip_skip | 2 | | logging_dir | /content/training_dir/logs | | additional_argument | --shuffle_caption --xformers | +--------------------------+---------------------------------------------------------------+ 2023-03-05 14:51:32.848337: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-03-05 14:51:33.505193: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2023-03-05 14:51:33.505334: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2023-03-05 14:51:33.505360: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2023-03-05 14:51:36.439065: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-03-05 14:51:37.065498: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2023-03-05 14:51:37.065609: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2023-03-05 14:51:37.065630: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. prepare tokenizer update token length: 225 Use DreamBooth method. prepare train images. found directory 10_vestia_zeta contains 152 image files 1520 train images with repeating. prepare reg images. 0 reg images. no regularization images / 正則化画像が見つかりませんでした loading image sizes. 100% 152/152 [00:00<00:00, 183.53it/s] make buckets number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (320, 704), count: 60 bucket 1: resolution (384, 640), count: 430 bucket 2: resolution (448, 576), count: 660 bucket 3: resolution (512, 512), count: 170 bucket 4: resolution (576, 448), count: 80 bucket 5: resolution (640, 384), count: 100 bucket 6: resolution (704, 320), count: 20 mean ar error (without repeats): 0.058514486511307695 prepare accelerator Using accelerator 0.15.0 or above. load StableDiffusion checkpoint Traceback (most recent call last): File "/content/kohya-trainer/train_network.py", line 528, in train(args) File "/content/kohya-trainer/train_network.py", line 97, in train textencoder, vae, unet, = train_util.load_target_model(args, weight_dtype) File "/content/kohya-trainer/library/train_util.py", line 1861, in load_target_model text_encoder, vae, unet = model_util.load_models_from_stable_diffusion_checkpoint(args.v2, name_or_path) File "/content/kohya-trainer/library/model_util.py", line 880, in load_models_from_stable_diffusion_checkpoint info = unet.load_state_dict(converted_unet_checkpoint) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1671, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for UNet2DConditionModel: size mismatch for down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 768]) from checkpoint, the shape in current model is torch.Size([320, 1024]). size mismatch for down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 768]) from checkpoint, the shape in current model is torch.Size([320, 1024]). size mismatch for down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 768]) from checkpoint, the shape in current model is torch.Size([320, 1024]). size mismatch for down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 768]) from checkpoint, the shape in current model is torch.Size([320, 1024]). size mismatch for down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 768]) from checkpoint, the shape in current model is torch.Size([640, 1024]). size mismatch for down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 768]) from checkpoint, the shape in current model is torch.Size([640, 1024]). size mismatch for down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 768]) from checkpoint, the shape in current model is torch.Size([640, 1024]). size mismatch for down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 768]) from checkpoint, the shape in current model is torch.Size([640, 1024]). size mismatch for down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 768]) from checkpoint, the shape in current model is torch.Size([1280, 1024]). size mismatch for down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 768]) from checkpoint, the shape in current model is torch.Size([1280, 1024]). size mismatch for down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 768]) from checkpoint, the shape in current model is torch.Size([1280, 1024]). size mismatch for down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 768]) from checkpoint, the shape in current model is torch.Size([1280, 1024]). size mismatch for up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 768]) from checkpoint, the shape in current model is torch.Size([1280, 1024]). size mismatch for up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 768]) from checkpoint, the shape in current model is torch.Size([1280, 1024]). size mismatch for up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 768]) from checkpoint, the shape in current model is torch.Size([1280, 1024]). size mismatch for up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 768]) from checkpoint, the shape in current model is torch.Size([1280, 1024]). size mismatch for up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 768]) from checkpoint, the shape in current model is torch.Size([1280, 1024]). size mismatch for up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 768]) from checkpoint, the shape in current model is torch.Size([1280, 1024]). size mismatch for up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 768]) from checkpoint, the shape in current model is torch.Size([640, 1024]). size mismatch for up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 768]) from checkpoint, the shape in current model is torch.Size([640, 1024]). size mismatch for up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 768]) from checkpoint, the shape in current model is torch.Size([640, 1024]). size mismatch for up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 768]) from checkpoint, the shape in current model is torch.Size([640, 1024]). size mismatch for up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 768]) from checkpoint, the shape in current model is torch.Size([640, 1024]). size mismatch for up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 768]) from checkpoint, the shape in current model is torch.Size([640, 1024]). size mismatch for up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 768]) from checkpoint, the shape in current model is torch.Size([320, 1024]). size mismatch for up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 768]) from checkpoint, the shape in current model is torch.Size([320, 1024]). size mismatch for up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 768]) from checkpoint, the shape in current model is torch.Size([320, 1024]). size mismatch for up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 768]) from checkpoint, the shape in current model is torch.Size([320, 1024]). size mismatch for up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 768]) from checkpoint, the shape in current model is torch.Size([320, 1024]). size mismatch for up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 768]) from checkpoint, the shape in current model is torch.Size([320, 1024]). size mismatch for mid_block.attentions.0.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 768]) from checkpoint, the shape in current model is torch.Size([1280, 1024]). size mismatch for mid_block.attentions.0.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 768]) from checkpoint, the shape in current model is torch.Size([1280, 1024]). Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1104, in launch_command simple_launcher(args) File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 567, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/kohya-trainer/train_network.py', '--v2', '--v_parameterization', '--output_name=vestia_zeta', '--pretrained_model_name_or_path=/content/pretrained_model/anything-v3-fp32-pruned.safetensors', '--vae=/content/vae/anime.vae.pt', '--train_data_dir=/content/drive/MyDrive/LoRA/zeta', '--reg_data_dir=/content/drive/MyDrive/LoRA/reg_data', '--output_dir=/content/drive/MyDrive/training_dir/output', '--network_dim=128', '--network_alpha=128', '--network_module=networks.lora', '--unet_lr=0.0001', '--text_encoder_lr=5e-05', '--optimizer_type=AdamW8bit', '--learning_rate=2e-06', '--lr_scheduler=constant', '--lr_warmup_steps=250', '--resolution=512', '--enable_bucket', '--keep_tokens=1', '--min_bucket_reso=256', '--max_bucket_reso=1024', '--caption_extension=.txt', '--cache_latents', '--prior_loss_weight=1.0', '--lowram', '--mixed_precision=fp16', '--save_precision=fp16', '--save_n_epoch_ratio=3', '--save_model_as=safetensors', '--train_batch_size=4', '--max_token_length=225', '--max_train_epochs=20', '--logging_dir=/content/training_dir/logs', '--log_prefix=vestia_zeta', '--shuffle_caption', '--xformers']' returned non-zero exit status 1.

Linaqruf commented 1 year ago

The problem is v2 and v_parameterization set to True while the model is Anything v3.2 which is SD v1.x model. You need to uncheck both and it will works fine.