bmaltais / kohya_ss

Apache License 2.0
9.52k stars 1.23k forks source link

Failing to start Lora training - Exit status 1 #428

Closed deadp closed 1 year ago

deadp commented 1 year ago

Hi, scrolled through some issues and none of the solutions here have worked so far.

Here is my output:


`Folder 100_test: 30 images found
Folder 100_test: 3000 steps
max_train_steps = 3000
stop_text_encoder_training = 0
lr_warmup_steps = 300
accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --enable_bucket --pretrained_model_name_or_path="/home/user/stable-diffusion-webui/models/Stable-diffusion/realisticVisionV13_v13VAEIncluded.safetensors" --train_data_dir="/home/user/image" --resolution=512,512 --output_dir="/home/user/lora" --logging_dir="" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=5e-5 --unet_lr=0.0001 --network_dim=8 --output_name="test" --lr_scheduler_num_cycles="1" --learning_rate="0.0001" --lr_scheduler="cosine" --lr_warmup_steps="300" --train_batch_size="1" --max_train_steps="3000" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --cache_latents --optimizer_type="AdamW" --bucket_reso_steps=64 --bucket_no_upscale 
2023-03-23 17:30:44.421890: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-23 17:30:44.559261: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-03-23 17:30:45.029480: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-23 17:30:45.029553: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-03-23 17:30:45.029562: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-23 17:30:46.544658: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-23 17:30:46.719307: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-03-23 17:30:47.216441: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-23 17:30:47.216497: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-03-23 17:30:47.216506: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
prepare tokenizer
Use DreamBooth method.
prepare images.
found directory /home/user/image/100_test contains 30 image files
3000 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
  batch_size: 1
  resolution: (512, 512)
  enable_bucket: True
  min_bucket_reso: 256
  max_bucket_reso: 1024
  bucket_reso_steps: 64
  bucket_no_upscale: True

  [Subset 0 of Dataset 0]
    image_dir: "/home/user/image/100_test"
    image_count: 30
    num_repeats: 100
    shuffle_caption: False
    keep_tokens: 0
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    is_reg: False
    class_tokens: test
    caption_extension: .caption

[Dataset 0]
loading image sizes.
100%|█████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 6716.26it/s]
make buckets
min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (512, 512), count: 3000
mean ar error (without repeats): 0.0
prepare accelerator
Traceback (most recent call last):
  File "/home/user/kohya_ss/train_network.py", line 699, in <module>
    train(args)
  File "/home/user/kohya_ss/train_network.py", line 119, in train
    accelerator, unwrap_model = train_util.prepare_accelerator(args)
  File "/home/user/kohya_ss/library/train_util.py", line 2498, in prepare_accelerator
    accelerator = Accelerator(
  File "/home/user/kohya_ss/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 355, in __init__
    raise ValueError(err.format(mode="fp16", requirement="a GPU"))
ValueError: fp16 mixed precision requires a GPU
Traceback (most recent call last):
  File "/home/user/kohya_ss/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/user/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/user/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
    simple_launcher(args)
  File "/home/user/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/user/kohya_ss/venv/bin/python3', 'train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=/home/user/stable-diffusion-webui/models/Stable-diffusion/realisticVisionV13_v13VAEIncluded.safetensors', '--train_data_dir=/home/user/image', '--resolution=512,512', '--output_dir=/home/user/lora', '--logging_dir=', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-5', '--unet_lr=0.0001', '--network_dim=8', '--output_name=test', '--lr_scheduler_num_cycles=1', '--learning_rate=0.0001', '--lr_scheduler=cosine', '--lr_warmup_steps=300', '--train_batch_size=1', '--max_train_steps=3000', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--cache_latents', '--optimizer_type=AdamW', '--bucket_reso_steps=64', '--bucket_no_upscale']' returned non-zero exit status 1.

I see one error says that no GPU is found? But I do have a GPU here and tried to select it.

Please let me know if I can provide any additional information.

bmaltais commented 1 year ago

I suggest you try deleting the kohys_ss folder and re-run the full setup and see if it help. Look like something did not get installed right.