bmaltais / kohya_ss

Apache License 2.0
8.78k stars 1.14k forks source link

Exit Status 1 Training has Ended #2474

Open samuelkurt opened 1 month ago

samuelkurt commented 1 month ago

I'm trying to Train my own Model with Windows, (since kohya_ss wouldn't launch on Linux). It endet up launching on Windows but everytime I try to start training it gets stuck on "Command executed", before throwing this long Error Message:

Log from CMD:

Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch(). 21:14:56-092646 INFO Destination training directory is missing... can't perform the required task... 21:15:01-645455 INFO Destination training directory is missing... can't perform the required task... 21:15:24-777659 INFO Copy C:/Users/Admin/Desktop/test to C:/Users/Admin/Desktop/training\img/40_qwertzu robot... 21:15:25-637630 INFO Regularization images directory is missing... not copying regularisation images... 21:15:25-640630 INFO Done creating kohya_ss training folder structure at C:/Users/Admin/Desktop/training... 21:15:58-310507 INFO Start training Dreambooth... 21:15:58-311506 INFO Validating lr scheduler arguments... 21:15:58-316506 INFO Validating optimizer arguments... 21:15:58-317506 INFO Validating model file or folder path C:/Users/Admin/Documents/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.saf etensors existence... 21:15:58-319506 INFO ...valid 21:15:58-321506 INFO Validating output_dir path C:/Users/Admin/Desktop/output existence... 21:15:58-324506 INFO ...valid 21:15:58-325506 INFO Validating train_data_dir path C:/Users/Admin/Desktop/test existence... 21:15:58-327505 INFO ...valid 21:15:58-328506 INFO reg_data_dir not specified, skipping validation 21:15:58-329505 INFO logging_dir not specified, skipping validation 21:15:58-331505 INFO log_tracker_config not specified, skipping validation 21:15:58-332506 INFO resume not specified, skipping validation 21:15:58-334506 INFO vae not specified, skipping validation 21:15:58-336506 INFO dataset_config not specified, skipping validation 21:15:58-348506 INFO Regulatization factor: 1 21:15:58-350506 INFO Total steps: 0 21:15:58-353506 INFO Train batch size: 1 21:15:58-355506 INFO Gradient accumulation steps: 1 21:15:58-357506 INFO Epoch: 1 21:15:58-359505 INFO Max train steps: 1600 21:15:58-360505 INFO lr_warmup_steps = 160 21:15:58-396503 INFO Saving training config to C:/Users/Admin/Desktop/output\last_20240509-211558.json... 21:15:58-400503 INFO Executing command: "C:\Users\Admin\Documents\kohya_ss\venv\Scripts\accelerate.EXE" launch --dynamo_backend no --dynamo_mode default --mixed_precision fp16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 "C:/Users/Admin/Documents/kohya_ss/sd-scripts/train_db.py" --config_file "./outputs/config_dreambooth-20240509-211558.toml" with shell=True 21:15:58-409503 INFO Command executed. Traceback (most recent call last): File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1390, in _get_module return importlib.import_module("." + module_name, self.name) File "C:\Program Files\Python310\lib\importlib__init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\transformers\models\clip\image_processing_clip.py", line 21, in from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\transformers\image_processing_utils.py", line 28, in from .image_transforms import center_crop, normalize, rescale File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\transformers\image_transforms.py", line 47, in import tensorflow as tf File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\tensorflow\init__.py", line 42, in from tensorflow.python import tf2 as _tf2 File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\tensorflow\python\tf2.py", line 21, in from tensorflow.python.platform import _pywrap_tf2 ImportError: DLL load failed while importing _pywrap_tf2: A dynamic link library (DLL) initialization routine failed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 710, in _get_module return importlib.import_module("." + module_name, self.name) File "C:\Program Files\Python310\lib\importlib__init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\diffusers\pipelines\stable_diffusion\pipeline_stable_diffusion.py", line 20, in from transformers import CLIPImageProcessor, CLIPTextModel, CLIPTokenizer, CLIPVisionModelWithProjection File "", line 1075, in _handle_fromlist File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1381, in getattr value = getattr(module, name) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1380, in getattr__ module = self._get_module(self._class_to_module[name]) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1392, in _get_module raise RuntimeError( RuntimeError: Failed to import transformers.models.clip.image_processing_clip because of the following error (look up to see its traceback): DLL load failed while importing _pywrap_tf2: A dynamic link library (DLL) initialization routine failed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\Admin\Documents\kohya_ss\sd-scripts\train_db.py", line 23, in import library.train_util as train_util File "C:\Users\Admin\Documents\kohya_ss\sd-scripts\library\train_util.py", line 46, in from diffusers import ( File "", line 1075, in _handle_fromlist File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 701, in getattr value = getattr(module, name) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 701, in getattr value = getattr(module, name) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 700, in getattr module = self._get_module(self._class_to_module[name]) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 712, in _get_module raise RuntimeError( RuntimeError: Failed to import diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion because of the following error (look up to see its traceback): Failed to import transformers.models.clip.image_processing_clip because of the following error (look up to see its traceback): DLL load failed while importing _pywrap_tf2: A dynamic link library (DLL) initialization routine failed. Traceback (most recent call last): File "C:\Program Files\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Program Files\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\Admin\Documents\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\Admin\Documents\kohya_ss\venv\Scripts\python.exe', 'C:/Users/Admin/Documents/kohya_ss/sd-scripts/train_db.py', '--config_file', './outputs/config_dreambooth-20240509-211558.toml']' returned non-zero exit status 1. 21:16:17-920832 INFO Training has ended.

I don't know what I'm doing wrong that this happens everytime I try to Train my Own Model.

bmaltais commented 1 month ago

There appear to be an issue with tensorflow on your system... Have you tried deleting the venv foldet and run setup.bat again? Sometime things can go bad and the venv get corrupted.

samuelkurt commented 1 month ago

Did that, still the same issue.

samuelkurt commented 1 month ago

What are all the requierments I need to install, maybe i've just forgot on.

bmaltais commented 1 month ago

They are listed on the main page README... Not many really... Some have reported that they uninstalled every single Python, CUDA, NVIdia drivers from their computer, then re-installed everything from scratch following the instructions and things worked fine after...

Windows is a complex beast and one bad install or software can lead to a lot of strange things.

samuelkurt commented 1 month ago

Did that, still got the same issue

b-fission commented 1 month ago

It's related to lack of AVX. (again)

I can reproduce the same error by doing bcdedit /set xsavedisable 1 and rebooting Windows.

@samuelkurt What is your CPU model? And are you running kohya_ss on a virtual machine?

samuelkurt commented 1 month ago

It's kind of embarrassing, but it's an Intel Core I7 @980X 3.33GHz and no, I'm not running it in a VM

b-fission commented 1 month ago

Okay, that CPU is from 2010.

So, the official builds of tensorflow are compiled with the assumption that AVX is usable. Since your CPU is too old and doesn't support AVX, you need to run a build of tensorflow 2.16 compiled with no AVX.

b-fission commented 1 month ago

I went ahead and built tensorflow and bitsandbytes without AVX for Windows. No guarantees if it'll work, so consider it experimental. Files and instructions are here.

dike1080 commented 1 month ago

I have the same problem as you

samuelkurt commented 1 month ago

It booted up and went as far as trying to Train, but got stuck on 0 steps for 2 minutes before throwing the same error

b-fission commented 1 month ago

Oops, I forgot to mention editing the requirements_windows.txt file. The gui will reinstall the original tensorflow and bitsandbytes packages because the version numbers are different.

I've updated the instructions to fix that.

Simply edit the requirements file as shown, then install the whl files like before.

samuelkurt commented 1 month ago

Still got the same issue

b-fission commented 1 month ago

Can you paste the error log? I'm not sure what else could be going on.

samuelkurt commented 1 month ago

Here you go:

11:06:03-155705 INFO Copy C:/Users/Admin/Documents/test to C:/Users/Admin/Documents/training\img/40_qwertzu robot... 11:06:05-182708 INFO Regularization images directory is missing... not copying regularisation images... 11:06:05-186709 INFO Done creating kohya_ss training folder structure at C:/Users/Admin/Documents/training... 11:06:05-640713 INFO Start training Dreambooth... 11:06:05-643710 INFO Validating lr scheduler arguments... 11:06:05-646708 INFO Validating optimizer arguments... 11:06:05-648707 INFO Validating C:/Users/Admin/Documents/training\log existence and writability... SUCCESS 11:06:05-651707 INFO Validating C:/Users/Admin/Documents/training\model existence and writability... SUCCESS 11:06:05-652708 INFO Validating runwayml/stable-diffusion-v1-5 existence... SKIPPING: huggingface.co model 11:06:05-655712 INFO Validating C:/Users/Admin/Documents/training\img existence... SUCCESS 11:06:05-657706 INFO Folder 40_qwertzu robot: 40 repeats found 11:06:05-660712 INFO Folder 40_qwertzu robot: 30 images found 11:06:05-663713 INFO Folder 40_qwertzu robot: 30 * 40 = 1200 steps 11:06:05-665713 INFO Regulatization factor: 1 11:06:05-666707 INFO Total steps: 1200 11:06:05-668712 INFO Train batch size: 1 11:06:05-671713 INFO Gradient accumulation steps: 1 11:06:05-673711 INFO Epoch: 1 11:06:05-674712 INFO Max train steps: 1600 11:06:05-676707 INFO lr_warmup_steps = 160 11:06:05-716712 INFO Saving training config to C:/Users/Admin/Documents/training\model\last_20240602-110605.json... 11:06:05-719712 INFO Executing command: C:\Users\Admin\Documents\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no --dynamo_mode default --mixed_precision fp16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 C:/Users/Admin/Documents/kohya_ss/sd-scripts/train_db.py --config_file C:/Users/Admin/Documents/training\model/config_dreambooth-20240602-110605.toml 11:06:07-510709 INFO Command executed. 2024-06-02 11:07:33 INFO Loading settings from train_util.py:3744 C:/Users/Admin/Documents/training\model/config_dreambooth-20240602-11060 5.toml... INFO C:/Users/Admin/Documents/training\model/config_dreambooth-20240602-11060 train_util.py:3763 5 2024-06-02 11:07:33 INFO prepare tokenizer train_util.py:4227 2024-06-02 11:07:35 INFO update token length: 75 train_util.py:4244 INFO prepare images. train_util.py:1572 INFO found directory C:\Users\Admin\Documents\training\img\40_qwertzu robot train_util.py:1519 contains 30 image files WARNING No caption file found for 30 images. Training will continue without train_util.py:1550 captions for these images. If class token exists, it will be used. / 30枚の画像にキャプションファイルが見つかりませんでした。これらの画像につ いてはキャプションなしで学習を続行します。class tokenが存在する場合はそれを使います。 WARNING C:\Users\Admin\Documents\training\img\40_qwertzu robot\test1.png train_util.py:1557 WARNING C:\Users\Admin\Documents\training\img\40_qwertzu robot\test10.png train_util.py:1557 WARNING C:\Users\Admin\Documents\training\img\40_qwertzu robot\test11.png train_util.py:1557 WARNING C:\Users\Admin\Documents\training\img\40_qwertzu robot\test12.png train_util.py:1557 WARNING C:\Users\Admin\Documents\training\img\40_qwertzu robot\test13.png train_util.py:1557 WARNING C:\Users\Admin\Documents\training\img\40_qwertzu robot\test14.png... and train_util.py:1555 25 more INFO 1200 train images with repeating. train_util.py:1613 INFO 0 reg images. train_util.py:1616 WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1621 INFO [Dataset 0] config_util.py:565 batch_size: 1 resolution: (512, 512) enable_bucket: True network_multiplier: 1.0 min_bucket_reso: 256 max_bucket_reso: 2048 bucket_reso_steps: 64 bucket_no_upscale: True

                           [Subset 0 of Dataset 0]
                             image_dir: "C:\Users\Admin\Documents\training\img\40_qwertzu robot"
                             image_count: 30
                             num_repeats: 40
                             shuffle_caption: False
                             keep_tokens: 0
                             keep_tokens_separator:
                             secondary_separator: None
                             enable_wildcard: False
                             caption_dropout_rate: 0.0
                             caption_dropout_every_n_epoches: 0
                             caption_tag_dropout_rate: 0.0
                             caption_prefix: None
                             caption_suffix: None
                             color_aug: False
                             flip_aug: False
                             face_crop_aug_range: None
                             random_crop: False
                             token_warmup_min: 1,
                             token_warmup_step: 0,
                             is_reg: False
                             class_tokens: qwertzu robot
                             caption_extension: .txt

                INFO     [Dataset 0]                                                              config_util.py:571
                INFO     loading image sizes.                                                      train_util.py:853

100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 61.22it/s] 2024-06-02 11:07:36 INFO make buckets train_util.py:859 WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is train_util.py:876 set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計 算されるため、min_bucket_resoとmax_bucket_resoは無視されます INFO number of images (including repeats) / train_util.py:905 各bucketの画像枚数(繰り返し回数を含む) INFO bucket 0: resolution (512, 512), count: 1200 train_util.py:910 INFO mean ar error (without repeats): 0.0 train_util.py:915 INFO prepare accelerator train_db.py:106 accelerator device: cuda INFO loading model for process 0/1 train_util.py:4385 INFO load Diffusers pretrained models: runwayml/stable-diffusion-v1-5 train_util.py:4347 model_index.json: 100%|████████████████████████████████████████████████████████████████| 541/541 [00:00<00:00, 537kB/s] C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\huggingface_hub\file_download.py:149: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Admin.cache\huggingface\hub\models--runwayml--stable-diffusion-v1-5. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the HF_HUB_DISABLE_SYMLINKS_WARNING environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations. To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development warnings.warn(message) (…)ature_extractor/preprocessor_config.json: 100%|████████████████████████████████████████████| 342/342 [00:00<?, ?B/s] safety_checker/config.json: 100%|█████████████████████████████████████████████████████████| 4.72k/4.72k [00:00<?, ?B/s] vae/config.json: 100%|████████████████████████████████████████████████████████████████████████| 547/547 [00:00<?, ?B/s] unet/config.json: 100%|███████████████████████████████████████████████████████████████████████| 743/743 [00:00<?, ?B/s] scheduler/scheduler_config.json: 100%|████████████████████████████████████████████████████████| 308/308 [00:00<?, ?B/s] text_encoder/config.json: 100%|███████████████████████████████████████████████████████████████| 617/617 [00:00<?, ?B/s] diffusion_pytorch_model.safetensors: 100%|██████████████████████████████████████████| 335M/335M [01:00<00:00, 5.54MB/s] model.safetensors: 100%|████████████████████████████████████████████████████████████| 492M/492M [01:04<00:00, 7.60MB/s] diffusion_pytorch_model.safetensors: 100%|████████████████████████████████████████| 3.44G/3.44G [02:51<00:00, 20.1MB/s] Fetching 10 files: 100%|███████████████████████████████████████████████████████████████| 10/10 [02:54<00:00, 17.43s/it] Loading pipeline components...: 100%|████████████████████████████████████████████████████| 5/5 [00:27<00:00, 5.45s/it] You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing safety_checker=None. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 . 2024-06-02 11:10:58 INFO UNet2DConditionModel: 64, 8, 768, False, False original_unet.py:1387 2024-06-02 11:11:11 INFO U-Net converted to original U-Net train_util.py:4372 INFO Enable xformers for U-Net train_util.py:2660 2024-06-02 11:11:22 INFO [Dataset 0] train_util.py:2079 INFO caching latents. train_util.py:974 INFO checking cache validity... train_util.py:984 100%|██████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<?, ?it/s] INFO caching latents... train_util.py:1021 100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [00:57<00:00, 1.91s/it] prepare optimizer, data loader etc. 2024-06-02 11:12:20 WARNING Could not find the bitsandbytes CUDA binary at cextension.py:94 WindowsPath('C:/Users/Admin/Documents/kohya_ss/venv/lib/site-packages/bits andbytes/libbitsandbytes_cuda118_nocublaslt.dll') WARNING The installed version of bitsandbytes was compiled without GPU support. cextension.py:101 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. INFO use 8-bit AdamW optimizer | {} train_util.py:3889 running training / 学習開始 num train images repeats / 学習画像の数×繰り返し回数: 1200 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 1200 num epochs / epoch数: 2 batch size per device / バッチサイズ: 1 total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ(並列学習、勾配合計含む): 1 gradient ccumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 1600 steps: 0%| | 0/1600 [00:00<?, ?it/s] epoch 1/2 Traceback (most recent call last): File "C:\Users\Admin\Documents\kohya_ss\sd-scripts\train_db.py", line 529, in train(args) File "C:\Users\Admin\Documents\kohya_ss\sd-scripts\train_db.py", line 386, in train optimizer.step() File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\accelerate\optimizer.py", line 132, in step self.scaler.step(self.optimizer, closure) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 416, in step retval = self._maybe_opt_step(optimizer, optimizer_state, args, kwargs) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 315, in _maybe_opt_step retval = optimizer.step(*args, *kwargs) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\accelerate\optimizer.py", line 185, in patched_step return method(args, kwargs) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\torch\optim\lr_scheduler.py", line 68, in wrapper return wrapped(*args, kwargs) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\torch\optim\optimizer.py", line 373, in wrapper out = func(*args, *kwargs) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\bitsandbytes\optim\optimizer.py", line 287, in step self.update_step(group, p, gindex, pindex) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\bitsandbytes\optim\optimizer.py", line 542, in update_step F.optimizer_update_8bit_blockwise( File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\bitsandbytes\functional.py", line 1773, in optimizer_update_8bit_blockwise optim_func = str2optimizer8bit_blockwise[optimizer_name][0] NameError: name 'str2optimizer8bit_blockwise' is not defined steps: 0%| | 0/1600 [04:13<?, ?it/s] Traceback (most recent call last): File "C:\Program Files\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Program Files\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\Admin\Documents\kohya_ss\venv\Scripts\accelerate.EXE__main__.py", line 7, in File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "C:\Users\Admin\Documents\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\Admin\Documents\kohya_ss\venv\Scripts\python.exe', 'C:/Users/Admin/Documents/kohya_ss/sd-scripts/train_db.py', '--config_file', 'C:/Users/Admin/Documents/training\model/config_dreambooth-20240602-110605.toml']' returned non-zero exit status 1. 11:16:54-771721 INFO Training has ended.

bmaltais commented 1 month ago

Try installing the CUDA optional drivers using setup.bat

b-fission commented 1 month ago

That's really strange. bitsandbytes wants to use a codepath without cublasLt, even though pytorch already includes the dll for that.

@samuelkurt I've uploaded a new build of bitsandbytes on this page which is 26mb in size. Can you download and install that whl?

samuelkurt commented 1 month ago

It seems to be working now, but takes very long (133.05s/it)

b-fission commented 1 month ago

That's... progress. But I have to wonder if it's fully using the GPU, like a power management setting in Windows or nvidia control panel. My laptop's GTX 1060 wasn't anywhere near that slow.

b-fission commented 1 month ago

@samuelkurt What if you changed these two settings in nvidia control panel?

CUDA Sysmem Fallback Policy = Prefer No Sysmem Fallback Power management mode = Prefer maximum performance

nvidiacuda

b-fission commented 1 month ago

Oh I see now. You're doing Dreambooth training, but that'll require a GPU with at least 10gb VRAM for SD1.5 models to get sane training speeds. Without enough VRAM, nvidia's sysmem fallback feature will help but will make it much slower.

I'd recommend training a Lora with your current setup instead of Dreambooth.

samuelkurt commented 1 month ago

My graphics card has 5Gb of VRAM, but with shared memory it has a total of 13Gb. (And yes, it fills it up, don’t t know if that will slow it down

b-fission commented 1 month ago

Using shared memory will very likely slow it down, especially for VRAM heavy workloads like Dreambooth.

If you can get the training config to run entirely within your GPU's dedicated VRAM capacity (5gb), the training speed should be a lot better than 100s/it. Try training a Lora and turn on Gradient Checkpointing.

samuelkurt commented 1 month ago

Update: LoRA Training is way faster (2s/it), and only takes up about 4 Gigs of VRAM. I'll let you know if it works. Edit: The training was successful and it generates pictures, but very low quality. (It's big progress)

samuelkurt commented 4 weeks ago

It can train LoRA no problem, but every generation is a disappointment and the size of the LoRA is locked at 10 Mb, doesn’t matter how many pictures I use.

b-fission commented 4 weeks ago

The lora size is configured using the Network Rank option.

If the trained lora still works poorly, adjust the Network Alpha and Learning rates as well.