bmaltais / kohya_ss

Apache License 2.0
9.52k stars 1.23k forks source link

Cant train loras #995

Closed riskay99 closed 1 year ago

riskay99 commented 1 year ago

Tried reinstalling a few times, disabling xformers setting, reducing resolution back to 512,512, and following a few tips from a linux post I saw (#810 I believe it was) and a few other settings. Using ubuntu and I believe Ive got everything installed right but its possible I messed up somehow. heres my error log though and hopefully someone can help me place where I went wrong or how to fix it.

./gui.sh --listen 127.0.0.1 --server_port 7860 --inbrowser

10:00:33-766950 INFO nVidia toolkit detected
10:00:34-143288 INFO Torch 1.12.1+cu116
10:00:34-156045 INFO Torch backend: nVidia CUDA 11.6 cuDNN 8302
10:00:34-157644 INFO Torch detected GPU: NVIDIA GeForce RTX 4090 VRAM 24214 Arch (8, 9) Cores 128
10:00:34-158386 INFO Verifying requirements
10:00:34-160147 INFO Installing package: diffusers[torch]==0.10.2
10:00:37-002950 INFO headless: False
10:00:37-005416 INFO Load CSS...
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch(). 10:01:18-606818 INFO Loading config...
10:01:20-012801 INFO Loading config...
10:01:23-732054 INFO Start training Dreambooth...
10:01:23-734574 INFO Valid image folder names found in: /media/sinco/keepblank2/software/workin/stable-diffusion-webui/zzzzz/avaluaca/avaluaca_lora/image
10:01:23-737095 INFO Folder 100_avaluaca : steps 7100
10:01:23-739271 INFO max_train_steps = 7100
10:01:23-741063 INFO stop_text_encoder_training = 0
10:01:23-742909 INFO lr_warmup_steps = 0
10:01:23-744889 INFO accelerate launch --num_cpu_threads_per_process=2 "train_db.py" --enable_bucket --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5"
--train_data_dir="/media/sinco/keepblank2/software/workin/stable-diffusion-webui/zzzzz/avaluaca/avaluaca_lora/image" --resolution="512,512"
--output_dir="/media/sinco/keepblank2/software/workin/stable-diffusion-webui/zzzzz/avaluaca/avaluaca_lora/model" --logging_dir="/media/sinco/keepblank2/software/workin/stable-diffusion-webui/zzzzz/avaluaca/avaluaca_lora/log"
--save_model_as=safetensors --output_name="avaluaca" --max_data_loader_n_workers="1" --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="7100" --save_every_n_epochs="1" --mixed_precision="bf16"
--save_precision="bf16" --seed="1234" --caption_extension=".txt" --cache_latents --optimizer_type="AdamW8bit" --max_data_loader_n_workers="1" --clip_skip=2 --bucket_reso_steps=64 --bucket_no_upscale
2023-06-15 10:01:24.219538: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-15 10:01:24.340521: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-06-15 10:01:24.729370: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory 2023-06-15 10:01:24.729414: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory 2023-06-15 10:01:24.729420: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. [10:01:25] WARNING The following values were not passed to accelerate launch and had defaults used instead: launch.py:1088 --num_processes was set to a value of 1
--num_machines was set to a value of 1
--mixed_precision was set to a value of 'no'
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
2023-06-15 10:01:25.972767: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-15 10:01:26.093622: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-06-15 10:01:26.479471: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory 2023-06-15 10:01:26.479516: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory 2023-06-15 10:01:26.479522: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. prepare tokenizer prepare images. found directory /media/sinco/keepblank2/software/workin/stable-diffusion-webui/zzzzz/avaluaca/avaluaca_lora/image/100_avaluaca contains 71 image files 7100 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 1 resolution: (512, 512) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 1024 bucket_reso_steps: 64 bucket_no_upscale: True

[Subset 0 of Dataset 0] image_dir: "/media/sinco/keepblank2/software/workin/stable-diffusion-webui/zzzzz/avaluaca/avaluaca_lora/image/100_avaluaca" image_count: 71 num_repeats: 100 shuffle_caption: False keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: avaluaca caption_extension: .txt

[Dataset 0] loading image sizes. 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 71/71 [00:00<00:00, 12031.66it/s] make buckets min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (384, 576), count: 7100 mean ar error (without repeats): 0.014184397163120588 prepare accelerator Using accelerator 0.15.0 or above. loading model for process 0/1 load Diffusers pretrained models: runwayml/stable-diffusion-v1-5 Fetching 15 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 88737.04it/s] You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing safety_checker=None. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 . [Dataset 0] caching latents. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 71/71 [00:02<00:00, 25.86it/s] prepare optimizer, data loader etc.

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

/media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/cv2/../../lib64')} warn( /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:105: UserWarning: /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/cv2/../../lib64: did not contain libcudart.so as expected! Searching further paths... warn( /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/org/freedesktop/DisplayManager/Seat0')} warn( /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('@/tmp/.ICE-unix/1827,unix/d3'), PosixPath('local/d3')} warn( /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/org/freedesktop/DisplayManager/Session0')} warn( /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('0'), PosixPath('1')} warn( CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')} warn( WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)! CUDA SETUP: Loading binary /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so... /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py:48: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable. warn( use 8-bit AdamW optimizer | {} running training / 学習開始 num train images repeats / 学習画像の数×繰り返し回数: 7100 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 7100 num epochs / epoch数: 1 batch size per device / バッチサイズ: 1 total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ(並列学習、勾配合計含む): 1 gradient ccumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 7100 steps: 0%| | 0/7100 [00:00<?, ?it/s] epoch 1/1 ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /media/sinco/keepblank2/software/workin/kohya_ss/train_db.py:482 in │ │ │ │ 479 │ args = parser.parse_args() │ │ 480 │ args = train_util.read_config_from_file(args, parser) │ │ 481 │ │ │ ❱ 482 │ train(args) │ │ 483 │ │ │ │ /media/sinco/keepblank2/software/workin/kohya_ss/train_db.py:346 in train │ │ │ │ 343 │ │ │ │ │ │ params_to_clip = unet.parameters() │ │ 344 │ │ │ │ │ accelerator.clip_gradnorm(params_to_clip, args.max_grad_norm) │ │ 345 │ │ │ │ │ │ ❱ 346 │ │ │ │ optimizer.step() │ │ 347 │ │ │ │ lr_scheduler.step() │ │ 348 │ │ │ │ optimizer.zero_grad(set_to_none=True) │ │ 349 │ │ │ │ /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/accelerate/op │ │ timizer.py:134 in step │ │ │ │ 131 │ │ │ │ xm.optimizer_step(self.optimizer, optimizer_args=optimizer_args) │ │ 132 │ │ │ elif self.scaler is not None: │ │ 133 │ │ │ │ scale_before = self.scaler.get_scale() │ │ ❱ 134 │ │ │ │ self.scaler.step(self.optimizer, closure) │ │ 135 │ │ │ │ self.scaler.update() │ │ 136 │ │ │ │ scale_after = self.scaler.get_scale() │ │ 137 │ │ │ │ # If we reduced the loss scale, it means the optimizer step was skipped │ │ │ │ /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/torch/cuda/am │ │ p/grad_scaler.py:338 in step │ │ │ │ 335 │ │ │ │ 336 │ │ assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were rec │ │ 337 │ │ │ │ ❱ 338 │ │ retval = self._maybe_opt_step(optimizer, optimizer_state, args, kwargs) │ │ 339 │ │ │ │ 340 │ │ optimizer_state["stage"] = OptState.STEPPED │ │ 341 │ │ │ │ /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/torch/cuda/am │ │ p/grad_scaler.py:285 in _maybe_opt_step │ │ │ │ 282 │ def _maybe_opt_step(self, optimizer, optimizer_state, *args, *kwargs): │ │ 283 │ │ retval = None │ │ 284 │ │ if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()): │ │ ❱ 285 │ │ │ retval = optimizer.step(args, kwargs) │ │ 286 │ │ return retval │ │ 287 │ │ │ 288 │ def step(self, optimizer, *args, kwargs): │ │ │ │ /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/torch/optim/l │ │ r_scheduler.py:65 in wrapper │ │ │ │ 62 │ │ │ │ instance = instance_ref() │ │ 63 │ │ │ │ instance._step_count += 1 │ │ 64 │ │ │ │ wrapped = func.get(instance, cls) │ │ ❱ 65 │ │ │ │ return wrapped(*args, kwargs) │ │ 66 │ │ │ │ │ 67 │ │ │ # Note that the returned function here is no longer a bound method, │ │ 68 │ │ │ # so attributes like __func__ and __self__ no longer exist. │ │ │ │ /media/sinco/keepblank2/software/workin/kohyass/venv/lib/python3.10/site-packages/torch/optim/o │ │ ptimizer.py:113 in wrapper │ │ │ │ 110 │ │ │ │ obj, * = args │ │ 111 │ │ │ │ profile_name = "Optimizer.step#{}.step".format(obj.class.name) │ │ 112 │ │ │ │ with torch.autograd.profiler.record_function(profile_name): │ │ ❱ 113 │ │ │ │ │ return func(*args, kwargs) │ │ 114 │ │ │ return wrapper │ │ 115 │ │ │ │ 116 │ │ hooked = getattr(self.class.step, "hooked", None) │ │ │ │ /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/torch/autogra │ │ d/grad_mode.py:27 in decorate_context │ │ │ │ 24 │ │ @functools.wraps(func) │ │ 25 │ │ def decorate_context(*args, kwargs): │ │ 26 │ │ │ with self.clone(): │ │ ❱ 27 │ │ │ │ return func(*args, kwargs) │ │ 28 │ │ return cast(F, decorate_context) │ │ 29 │ │ │ 30 │ def _wrap_generator(self, func): │ │ │ │ /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/ │ │ optim/optimizer.py:265 in step │ │ │ │ 262 │ │ │ │ if len(state) == 0: │ │ 263 │ │ │ │ │ self.init_state(group, p, gindex, pindex) │ │ 264 │ │ │ │ │ │ ❱ 265 │ │ │ │ self.update_step(group, p, gindex, pindex) │ │ 266 │ │ │ │ 267 │ │ return loss │ │ 268 │ │ │ │ /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/torch/autogra │ │ d/grad_mode.py:27 in decorate_context │ │ │ │ 24 │ │ @functools.wraps(func) │ │ 25 │ │ def decorate_context(*args, *kwargs): │ │ 26 │ │ │ with self.clone(): │ │ ❱ 27 │ │ │ │ return func(args, kwargs) │ │ 28 │ │ return cast(F, decorate_context) │ │ 29 │ │ │ 30 │ def _wrap_generator(self, func): │ │ │ │ /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/ │ │ optim/optimizer.py:506 in update_step │ │ │ │ 503 │ │ │ state["max1"], state["new_max1"] = state["new_max1"], state["max1"] │ │ 504 │ │ │ state["max2"], state["new_max2"] = state["new_max2"], state["max2"] │ │ 505 │ │ elif state["state1"].dtype == torch.uint8 and config["block_wise"]: │ │ ❱ 506 │ │ │ F.optimizer_update_8bit_blockwise( │ │ 507 │ │ │ │ self.optimizer_name, │ │ 508 │ │ │ │ grad, │ │ 509 │ │ │ │ p, │ │ │ │ /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/ │ │ functional.py:858 in optimizer_update_8bit_blockwise │ │ │ │ 855 ) -> None: │ │ 856 │ │ │ 857 │ if g.dtype == torch.float32 and state1.dtype == torch.uint8: │ │ ❱ 858 │ │ str2optimizer8bit_blockwise[optimizer_name][0]( │ │ 859 │ │ │ get_ptr(p), │ │ 860 │ │ │ get_ptr(g), │ │ 861 │ │ │ get_ptr(state1), │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ NameError: name 'str2optimizer8bit_blockwise' is not defined steps: 0%| | 0/7100 [00:00<?, ?it/s] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /media/sinco/keepblank2/software/workin/kohya_ss/venv/bin/accelerate:8 in │ │ │ │ 5 from accelerate.commands.accelerate_cli import main │ │ 6 if name == 'main': │ │ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 8 │ sys.exit(main()) │ │ 9 │ │ │ │ /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/accelerate/co │ │ mmands/accelerate_cli.py:45 in main │ │ │ │ 42 │ │ exit(1) │ │ 43 │ │ │ 44 │ # Run │ │ ❱ 45 │ args.func(args) │ │ 46 │ │ 47 │ │ 48 if name == "main": │ │ │ │ /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/accelerate/co │ │ mmands/launch.py:1104 in launch_command │ │ │ │ 1101 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │ │ 1102 │ │ sagemaker_launcher(defaults, args) │ │ 1103 │ else: │ │ ❱ 1104 │ │ simple_launcher(args) │ │ 1105 │ │ 1106 │ │ 1107 def main(): │ │ │ │ /media/sinco/keepblank2/software/workin/kohya_ss/venv/lib/python3.10/site-packages/accelerate/co │ │ mmands/launch.py:567 in simple_launcher │ │ │ │ 564 │ process = subprocess.Popen(cmd, env=current_env) │ │ 565 │ process.wait() │ │ 566 │ if process.returncode != 0: │ │ ❱ 567 │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │ │ 568 │ │ 569 │ │ 570 def multi_gpu_launcher(args): │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ CalledProcessError: Command '['/media/sinco/keepblank2/software/workin/kohya_ss/venv/bin/python', 'train_db.py', '--enable_bucket', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--train_data_dir=/media/sinco/keepblank2/software/workin/stable-diffusion-webui/zzzzz/avaluaca/avaluaca_lora/image', '--resolution=512,512', '--output_dir=/media/sinco/keepblank2/software/workin/stable-diffusion-webui/zzzzz/avaluaca/avaluaca_lora/model', '--logging_dir=/media/sinco/keepblank2/software/workin/stable-diffusion-webui/zzzzz/avaluaca/avaluaca_lora/log', '--save_model_as=safetensors', '--output_name=avaluaca', '--max_data_loader_n_workers=1', '--learning_rate=0.0001', '--lr_scheduler=constant', '--train_batch_size=1', '--max_train_steps=7100', '--save_every_n_epochs=1', '--mixed_precision=bf16', '--save_precision=bf16', '--seed=1234', '--caption_extension=.txt', '--cache_latents', '--optimizer_type=AdamW8bit', '--max_data_loader_n_workers=1', '--clip_skip=2', '--bucket_reso_steps=64', '--bucket_no_upscale']' returned non-zero exit status 1.

caiyesd commented 1 year ago

I meet the same issue.

22:28:50-254987 INFO Start training LoRA Standard ... 22:28:50-256577 INFO Valid image folder names found in: /share/images 22:28:50-258238 INFO Folder 100_zhouxun: 21 images found 22:28:50-259241 INFO Folder 100_zhouxun: 2100 steps 22:28:50-260217 INFO Total steps: 2100 22:28:50-261194 INFO Train batch size: 2 22:28:50-262145 INFO Gradient accumulation steps: 1 22:28:50-263090 INFO Epoch: 1 22:28:50-263995 INFO Regulatization factor: 1 22:28:50-264961 INFO max_train_steps (2100 / 2 / 1 1 1) = 1050 22:28:50-266098 INFO stop_text_encoder_training = 0 22:28:50-267022 INFO lr_warmup_steps = 0 22:28:50-268037 INFO accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --train_data_dir="/share/images" --resolution="512,512" --output_dir="/share/models" --logging_dir="/share/logs" --network_alpha="128" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=5e-05 --unet_lr=0.0001 --network_dim=128 --output_name="Addams" --lr_scheduler_num_cycles="1" --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="2" --max_train_steps="1050" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --seed="1234" --caption_extension=".txt" --cache_latents --optimizer_type="AdamW8bit" --max_data_loader_n_workers="1" --clip_skip=2 --bucket_reso_steps=64 --xformers --bucket_no_upscale 2023-06-16 22:28:51.054437: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-16 22:28:51.237574: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-06-16 22:28:51.837600: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-06-16 22:28:51.837694: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-06-16 22:28:51.837713: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. [22:28:53] WARNING The following values were not passed to accelerate launch and had defaults used instead: launch.py:1088 --num_processes was set to a value of 1 --num_machines was set to a value of 1 --mixed_precision was set to a value of 'no' --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. 2023-06-16 22:28:53.861223: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-16 22:28:54.043173: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-06-16 22:28:54.645775: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-06-16 22:28:54.645858: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-06-16 22:28:54.645878: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /kohya_ss/train_network.py:17 in │ │ │ │ 14 from accelerate.utils import set_seed │ │ 15 from diffusers import DDPMScheduler │ │ 16 │ │ ❱ 17 import library.train_util as train_util │ │ 18 from library.train_util import ( │ │ 19 │ DreamBoothDataset, │ │ 20 ) │ │ │ │ /kohya_ss/library/train_util.py:56 in │ │ │ │ 53 │ KDPM2AncestralDiscreteScheduler, │ │ 54 ) │ │ 55 from huggingface_hub import hf_hub_download │ │ ❱ 56 import albumentations as albu │ │ 57 import numpy as np │ │ 58 from PIL import Image │ │ 59 import cv2 │ │ │ │ /usr/local/lib/python3.10/dist-packages/albumentations/init.py:5 in │ │ │ │ 2 │ │ 3 version = "1.3.0" │ │ 4 │ │ ❱ 5 from .augmentations import │ │ 6 from .core.composition import │ │ 7 from .core.serialization import │ │ 8 from .core.transforms_interface import │ │ │ │ /usr/local/lib/python3.10/dist-packages/albumentations/augmentations/init.py:2 in │ │ │ │ 1 # Common classes │ │ ❱ 2 from .blur.functional import │ │ 3 from .blur.transforms import │ │ 4 from .crops.functional import │ │ 5 from .crops.transforms import │ │ │ │ /usr/local/lib/python3.10/dist-packages/albumentations/augmentations/blur/init.py:1 in │ │ │ │ │ │ ❱ 1 from .functional import │ │ 2 from .transforms import │ │ 3 │ │ │ │ /usr/local/lib/python3.10/dist-packages/albumentations/augmentations/blur/functional.py:5 in │ │ │ │ │ │ 2 from math import ceil │ │ 3 from typing import Sequence, Union │ │ 4 │ │ ❱ 5 import cv2 │ │ 6 import numpy as np │ │ 7 │ │ 8 from albumentations.augmentations.functional import convolve │ │ │ │ /usr/local/lib/python3.10/dist-packages/cv2/init.py:181 in │ │ │ │ 178 │ if DEBUG: print('OpenCV loader: DONE') │ │ 179 │ │ 180 │ │ ❱ 181 bootstrap() │ │ 182 │ │ │ │ /usr/local/lib/python3.10/dist-packages/cv2/init.py:153 in bootstrap │ │ │ │ 150 │ │ │ 151 │ py_module = sys.modules.pop("cv2") │ │ 152 │ │ │ ❱ 153 │ native_module = importlib.import_module("cv2") │ │ 154 │ │ │ 155 │ sys.modules["cv2"] = py_module │ │ 156 │ setattr(py_module, "_native", native_module) │ │ │ │ /usr/lib/python3.10/importlib/init.py:126 in import_module │ │ │ │ 123 │ │ │ if character != '.': │ │ 124 │ │ │ │ break │ │ 125 │ │ │ level += 1 │ │ ❱ 126 │ return _bootstrap._gcd_import(name[level:], package, level) │ │ 127 │ │ 128 │ │ 129 _RELOADING = {} │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ImportError: libGL.so.1: cannot open shared object file: No such file or directory ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /usr/local/bin/accelerate:8 in │ │ │ │ 5 from accelerate.commands.accelerate_cli import main │ │ 6 if name == 'main': │ │ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 8 │ sys.exit(main()) │ │ 9 │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py:45 in main │ │ │ │ 42 │ │ exit(1) │ │ 43 │ │ │ 44 │ # Run │ │ ❱ 45 │ args.func(args) │ │ 46 │ │ 47 │ │ 48 if name == "main": │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:1104 in launch_command │ │ │ │ 1101 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │ │ 1102 │ │ sagemaker_launcher(defaults, args) │ │ 1103 │ else: │ │ ❱ 1104 │ │ simple_launcher(args) │ │ 1105 │ │ 1106 │ │ 1107 def main(): │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:567 in simple_launcher │ │ │ │ 564 │ process = subprocess.Popen(cmd, env=current_env) │ │ 565 │ process.wait() │ │ 566 │ if process.returncode != 0: │ │ ❱ 567 │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │ │ 568 │ │ 569 │ │ 570 def multi_gpu_launcher(args): │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ CalledProcessError: Command '['/usr/bin/python', 'train_network.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--train_data_dir=/share/images', '--resolution=512,512', '--output_dir=/share/models', '--logging_dir=/share/logs', '--network_alpha=128', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-05', '--unet_lr=0.0001', '--network_dim=128', '--output_name=Addams', '--lr_scheduler_num_cycles=1', '--learning_rate=0.0001', '--lr_scheduler=constant', '--train_batch_size=2', '--max_train_steps=1050', '--save_every_n_epochs=1', '--mixed_precision=bf16', '--save_precision=bf16', '--seed=1234', '--caption_extension=.txt', '--cache_latents', '--optimizer_type=AdamW8bit', '--max_data_loader_n_workers=1', '--clip_skip=2', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 1.

riskay99 commented 1 year ago

for me the issue was a broken conda install. reinstalling conda seemed to fix this

moxSedai commented 1 year ago

I still have a similar issue, but not with conda.

To create a public link, set share=True in launch(). 21:28:48-438984 INFO Start training LoRA Standard ...
21:28:48-439679 INFO Valid image folder names found in: /home/max/Pictures/LoRA/training_testing/LORA/image/
21:28:48-440361 INFO Folder 100_training_testing: 223 images found
21:28:48-440811 INFO Folder 100_training_testing: 22300 steps
21:28:48-441207 INFO Total steps: 22300
21:28:48-441579 INFO Train batch size: 2
21:28:48-441963 INFO Gradient accumulation steps: 1
21:28:48-442340 INFO Epoch: 1
21:28:48-442691 INFO Regulatization factor: 1
21:28:48-443091 INFO max_train_steps (22300 / 2 / 1 1 1) = 11150
21:28:48-443556 INFO stop_text_encoder_training = 0
21:28:48-443932 INFO lr_warmup_steps = 0
21:28:48-444350 INFO accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --enable_bucket --pretrained_model_name_or_path="/media/big_hhd/applications/stable-diffusion-webui/models/Stable-diffusion/AnythingV5_v5RE.ckpt"
--train_data_dir="/home/max/Pictures/LoRA/training_testing/LORA/image/" --resolution="768,768" --output_dir="/home/max/Pictures/LoRA/training_testing/LORA/model/ "
--/home/max/Pictures/LoRA/training_testing/LORA/log/" --network_alpha="128" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=5e-05 --unet_lr=0.0001 --network_dim=128
--output_name="tests" --lr_scheduler_num_cycles="1" --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="2" --max_train_steps="11150" --save_every_n_epochs="1" --mixed_precision="bf16"
--save_precision="bf16" --seed="1234" --caption_extension=".txt" --cache_latents --optimizer_type="AdamW8bit" --max_data_loader_n_workers="1" --clip_skip=2 --bucket_reso_steps=64 --xformers --bucket_no_upscale
2023-06-23 21:28:49.900589: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-06-23 21:28:49.976795: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-23 21:28:50.326511: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [21:28:50] WARNING The following values were not passed to accelerate launch and had defaults used instead: launch.py:890 --num_processes was set to a value of 1
--num_machines was set to a value of 1
--mixed_precision was set to a value of 'no'
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
2023-06-23 21:28:52.016254: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT prepare tokenizer Using DreamBooth method. prepare images. found directory /home/max/Pictures/LoRA/training_testing/LORA/image/100_training_testing contains 223 image files 22300 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 2 resolution: (768, 768) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 1024 bucket_reso_steps: 64 bucket_no_upscale: True

[Subset 0 of Dataset 0] image_dir: "/home/max/Pictures/LoRA/training_testing/LORA/image/100_training_testing" image_count: 223 num_repeats: 100 shuffle_caption: False keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: testout caption_extension: .txt

[Dataset 0] loading image sizes. 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 223/223 [00:00<00:00, 6681.31it/s] make buckets min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (192, 128), count: 100 bucket 1: resolution (192, 192), count: 300 bucket 2: resolution (256, 192), count: 500 bucket 3: resolution (256, 448), count: 100 bucket 4: resolution (320, 448), count: 100 bucket 5: resolution (320, 512), count: 100 bucket 6: resolution (448, 640), count: 100 bucket 7: resolution (448, 704), count: 100 bucket 8: resolution (448, 1152), count: 100 bucket 9: resolution (512, 640), count: 100 bucket 10: resolution (512, 768), count: 100 bucket 11: resolution (576, 384), count: 200 bucket 12: resolution (576, 448), count: 300 bucket 13: resolution (576, 640), count: 100 bucket 14: resolution (576, 704), count: 100 bucket 15: resolution (576, 832), count: 400 bucket 16: resolution (576, 896), count: 600 bucket 17: resolution (576, 960), count: 200 bucket 18: resolution (640, 448), count: 200 bucket 19: resolution (640, 512), count: 300 bucket 20: resolution (640, 576), count: 100 bucket 21: resolution (640, 768), count: 300 bucket 22: resolution (640, 832), count: 2400 bucket 23: resolution (640, 896), count: 3100 bucket 24: resolution (704, 320), count: 100 bucket 25: resolution (704, 512), count: 300 bucket 26: resolution (704, 704), count: 1100 bucket 27: resolution (704, 768), count: 600 bucket 28: resolution (704, 832), count: 100 bucket 29: resolution (768, 384), count: 200 bucket 30: resolution (768, 512), count: 500 bucket 31: resolution (768, 576), count: 1800 bucket 32: resolution (768, 640), count: 1000 bucket 33: resolution (768, 704), count: 100 bucket 34: resolution (768, 768), count: 1300 bucket 35: resolution (832, 448), count: 100 bucket 36: resolution (832, 576), count: 700 bucket 37: resolution (832, 640), count: 1900 bucket 38: resolution (832, 704), count: 300 bucket 39: resolution (896, 576), count: 500 bucket 40: resolution (896, 640), count: 1000 bucket 41: resolution (960, 576), count: 100 bucket 42: resolution (1024, 576), count: 600 mean ar error (without repeats): 0.02504783041525756 preparing accelerator /media/big_hhd/applications/kohya_ss/venv/lib/python3.10/site-packages/accelerate/accelerator.py:258: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead. warnings.warn( Using accelerator 0.15.0 or above. loading model for process 0/1 load StableDiffusion checkpoint: /media/big_hhd/applications/stable-diffusion-webui/models/Stable-diffusion/AnythingV5_v5RE.ckpt loading u-net: loading vae: loading text encoder: CrossAttention.forward has been replaced to enable xformers. import network module: networks.lora [Dataset 0] caching latents. 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 223/223 [00:20<00:00, 10.99it/s] create LoRA network. base dim (rank): 128, alpha: 128.0 neuron dropout: p=None, rank dropout: p=None, module dropout: p=None create LoRA for Text Encoder: 72 modules. create LoRA for U-Net: 192 modules. enable LoRA for text encoder enable LoRA for U-Net preparing optimizer, data loader etc. ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /media/big_hhd/applications/kohya_ss/train_network.py:873 in │ │ │ │ 870 │ args = parser.parse_args() │ │ 871 │ args = train_util.read_config_from_file(args, parser) │ │ 872 │ │ │ ❱ 873 │ train(args) │ │ 874 │ │ │ │ /media/big_hhd/applications/kohya_ss/train_network.py:262 in train │ │ │ │ 259 │ │ ) │ │ 260 │ │ trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.u │ │ 261 │ │ │ ❱ 262 │ optimizer_name, optimizer_args, optimizer = train_util.get_optimizer(args, trainable │ │ 263 │ │ │ 264 │ # dataloaderを準備する │ │ 265 │ # DataLoaderのプロセス数:0はメインプロセスになる │ │ │ │ /media/big_hhd/applications/kohya_ss/library/train_util.py:2699 in get_optimizer │ │ │ │ 2696 │ │ │ 2697 │ if optimizer_type == "AdamW8bit".lower(): │ │ 2698 │ │ try: │ │ ❱ 2699 │ │ │ import bitsandbytes as bnb │ │ 2700 │ │ except ImportError: │ │ 2701 │ │ │ raise ImportError("No bitsand bytes / bitsandbytesがインストールされていない │ │ 2702 │ │ print(f"use 8-bit AdamW optimizer | {optimizer_kwargs}") │ │ │ │ /media/big_hhd/applications/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/init.py: │ │ 5 in │ │ │ │ 2 # │ │ 3 # This source code is licensed under the MIT license found in the │ │ 4 # LICENSE file in the root directory of this source tree. │ │ ❱ 5 from .optim import adam │ │ 6 from .nn import modules │ │ 7 print('='30 + 'WARNING: DEPRECATED!' + '='30) │ │ 8 print('WARNING! This version of bitsandbytes is deprecated. Please switch to `pip instal │ │ │ │ /media/big_hhd/applications/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/optim/init │ │ .py:5 in │ │ │ │ 2 # │ │ 3 # This source code is licensed under the MIT license found in the │ │ 4 # LICENSE file in the root directory of this source tree. │ │ ❱ 5 from .adam import Adam, Adam8bit, Adam32bit │ │ 6 from .adamw import AdamW, AdamW8bit, AdamW32bit │ │ 7 from .sgd import SGD, SGD8bit, SGD32bit │ │ 8 from .lars import LARS, LARS8bit, LARS32bit, PytorchLARS │ │ │ │ /media/big_hhd/applications/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/optim/adam.p │ │ y:11 in │ │ │ │ 8 │ │ 9 import torch │ │ 10 import torch.distributed as dist │ │ ❱ 11 from bitsandbytes.optim.optimizer import Optimizer2State │ │ 12 import bitsandbytes.functional as F │ │ 13 │ │ 14 class Adam(Optimizer2State): │ │ │ │ /media/big_hhd/applications/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/optim/optimi │ │ zer.py:6 in │ │ │ │ 3 # This source code is licensed under the MIT license found in the │ │ 4 # LICENSE file in the root directory of this source tree. │ │ 5 import torch │ │ ❱ 6 import bitsandbytes.functional as F │ │ 7 │ │ 8 from copy import deepcopy │ │ 9 from itertools import chain │ │ │ │ /media/big_hhd/applications/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/functional.p │ │ y:13 in │ │ │ │ 10 from torch import Tensor │ │ 11 from typing import Tuple │ │ 12 │ │ ❱ 13 lib = ct.cdll.LoadLibrary(os.path.dirname(file) + '/libbitsandbytes.so') │ │ 14 name2qmap = {} │ │ 15 │ │ 16 ''' C FUNCTIONS FOR OPTIMIZERS ''' │ │ │ │ /usr/lib/python3.10/ctypes/init.py:452 in LoadLibrary │ │ │ │ 449 │ │ return getattr(self, name) │ │ 450 │ │ │ 451 │ def LoadLibrary(self, name): │ │ ❱ 452 │ │ return self._dlltype(name) │ │ 453 │ │ │ 454 │ class_getitem__ = classmethod(_types.GenericAlias) │ │ 455 │ │ │ │ /usr/lib/python3.10/ctypes/init.py:374 in init │ │ │ │ 371 │ │ self._FuncPtr = _FuncPtr │ │ 372 │ │ │ │ 373 │ │ if handle is None: │ │ ❱ 374 │ │ │ self._handle = _dlopen(self._name, mode) │ │ 375 │ │ else: │ │ 376 │ │ │ self._handle = handle │ │ 377 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ OSError: libcusparse.so.11: cannot open shared object file: No such file or directory ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /media/big_hhd/applications/kohya_ss/venv/bin/accelerate:8 in │ │ │ │ 5 from accelerate.commands.accelerate_cli import main │ │ 6 if name == 'main': │ │ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 8 │ sys.exit(main()) │ │ 9 │ │ │ │ /media/big_hhd/applications/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accel │ │ erate_cli.py:45 in main │ │ │ │ 42 │ │ exit(1) │ │ 43 │ │ │ 44 │ # Run │ │ ❱ 45 │ args.func(args) │ │ 46 │ │ 47 │ │ 48 if name == "main__": │ │ │ │ /media/big_hhd/applications/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launc │ │ h.py:918 in launch_command │ │ │ │ 915 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │ │ 916 │ │ sagemaker_launcher(defaults, args) │ │ 917 │ else: │ │ ❱ 918 │ │ simple_launcher(args) │ │ 919 │ │ 920 │ │ 921 def main(): │ │ │ │ /media/big_hhd/applications/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launc │ │ h.py:580 in simple_launcher │ │ │ │ 577 │ process.wait() │ │ 578 │ if process.returncode != 0: │ │ 579 │ │ if not args.quiet: │ │ ❱ 580 │ │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │ │ 581 │ │ else: │ │ 582 │ │ │ sys.exit(1) │ │ 583 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ CalledProcessError: Command '['/media/big_hhd/applications/kohya_ss/venv/bin/python', 'train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=/media/big_hhd/applications/stable-diffusion-webui/models/Stable-diffusion/AnythingV5_v5RE.ckpt', '--train_data_dir=/home/max/Pictures/LoRA/training_testing/LORA/image/', '--resolution=768,768', '--output_dir=/home/max/Pictures/LoRA/training_testing/LORA/model/', '--logging_dir=/home/max/Pictures/LoRA/training_testing/LORA/log/', '--network_alpha=128', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-05', '--unet_lr=0.0001', '--network_dim=128', '--output_name=test', '--lr_scheduler_num_cycles=1', '--learning_rate=0.0001', '--lr_scheduler=constant', '--train_batch_size=2', '--max_train_steps=11150', '--save_every_n_epochs=1', '--mixed_precision=bf16', '--save_precision=bf16', '--seed=1234', '--caption_extension=.txt', '--cache_latents', '--optimizer_type=AdamW8bit', '--max_data_loader_n_workers=1', '--clip_skip=2', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 1.

LalleSX commented 1 year ago

Looks to be because of some failure with AdamW8bit optimizer or something like that

I had the same Issue and I changed from AdamW8bit to AdamW (picrel) pic-selected-230801-1952-52

@riskay99 @caiyesd @maxerature