bmaltais / kohya_ss

Apache License 2.0
9.53k stars 1.23k forks source link

self.size(-1) must be divisible by 4? 'charmap' codec can't decode byte 0x8d? Unable to load weights from checkpoint? #125

Closed DivinoAG closed 1 year ago

DivinoAG commented 1 year ago

Hello, I'm getting a number of different errors when attempting to run a LoRA training session here, and I can't really pinpoint what is the cause. I hope anyone here have any insights. I'm running this on a mobile 3060.

Below is my error log.

prepare dataset
prepare accelerator
Using accelerator 0.15.0 or above.
load Diffusers pretrained models
text_encoder\model.safetensors not found
Fetching 19 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\diffusers\modeling_utils.py", line 99, in load_state_dict
    return safetensors.torch.load_file(checkpoint_file, device="cpu")
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\safetensors\torch.py", line 100, in load_file
    result[k] = f.get_tensor(k)
RuntimeError: self.size(-1) must be divisible by 4 to view Byte as Float (different element sizes), but got 1627821

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\diffusers\modeling_utils.py", line 103, in load_state_dict
    if f.read().startswith("version"):
  File "C:\Users\andre\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 83705: character maps to <undefined>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Kohya\kohya_ss\train_network.py", line 542, in <module>
    train(args)
  File "C:\Kohya\kohya_ss\train_network.py", line 152, in train
    text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype)
  File "C:\Kohya\kohya_ss\library\train_util.py", line 1527, in load_target_model
    pipe = StableDiffusionPipeline.from_pretrained(args.pretrained_model_name_or_path, tokenizer=None, safety_checker=None)
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\diffusers\pipeline_utils.py", line 709, in from_pretrained
    loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\diffusers\modeling_utils.py", line 489, in from_pretrained
    state_dict = load_state_dict(model_file)
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\diffusers\modeling_utils.py", line 115, in load_state_dict
    raise OSError(
OSError: Unable to load weights from checkpoint file for 'C:\Users\andre/.cache\huggingface\diffusers\models--runwayml--stable-diffusion-v1-5\snapshots\39593d5650112b4cc580433f6b0435385882d819\unet\diffusion_pytorch_model.safetensors' at 'C:\Users\andre/.cache\huggingface\diffusers\models--runwayml--stable-diffusion-v1-5\snapshots\39593d5650112b4cc580433f6b0435385882d819\unet\diffusion_pytorch_model.safetensors'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
Traceback (most recent call last):
  File "C:\Users\andre\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\andre\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Kohya\kohya_ss\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
    args.func(args)
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1104, in launch_command
    simple_launcher(args)
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 567, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Kohya\\kohya_ss\\venv\\Scripts\\python.exe', 'train_network.py', '--bucket_reso_steps=1', '--bucket_no_upscale', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--train_data_dir=C:/Temp/omgcsply lora/img', '--resolution=512,512', '--output_dir=C:/Temp/omgcsply lora/model', '--logging_dir=C:/Temp/omgcsply lora/log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-5', '--unet_lr=1e-3', '--network_dim=8', '--output_name=omgcsply', '--lr_scheduler_num_cycles=1', '--learning_rate=0.0001', '--lr_scheduler=constant', '--train_batch_size=1', '--max_train_steps=6800', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=1234', '--caption_extension=.txt', '--cache_latents', '--max_data_loader_n_workers=1', '--clip_skip=2', '--bucket_reso_steps=64', '--mem_eff_attn', '--gradient_checkpointing', '--bucket_no_upscale']' returned non-zero exit status 1.
bmaltais commented 1 year ago

I suspect this is an issue with the model you are using as the base for the training. If you use one of the quicksetting model like SD1.5, does it work?

DivinoAG commented 1 year ago

I suspect this is an issue with the model you are using as the base for the training. If you use one of the quicksetting model like SD1.5, does it work?

That is exactly the one I was trying it with. I tried with my local version of SD 1.5 .ckpt as well, the one I use with Easy Diffusion and A1111 WebUI, and it seemed to go a little bit further but then it gave me a different error, CUDA detection failed. Below is the relevant log:

CUDA SETUP: TODO: compile library for specific version: libbitsandbytes_cuda116.dll
CUDA SETUP: Defaulting to libbitsandbytes.so...
CUDA SETUP: CUDA detection failed. Either CUDA driver not installed, CUDA not installed, or you have multiple conflicting CUDA libraries!
CUDA SETUP: If you compiled from source, try again with `make CUDA_VERSION=DETECTED_CUDA_VERSION` for example, `make CUDA_VERSION=113`.
Traceback (most recent call last):
  File "C:\Kohya\kohya_ss\train_db.py", line 346, in <module>
    train(args)
  File "C:\Kohya\kohya_ss\train_db.py", line 122, in train
    import bitsandbytes as bnb
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\bitsandbytes\__init__.py", line 6, in <module>
    from .autograd._functions import (
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\bitsandbytes\autograd\_functions.py", line 5, in <module>
    import bitsandbytes.functional as F
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\bitsandbytes\functional.py", line 13, in <module>
    from .cextension import COMPILED_WITH_CUDA, lib
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\bitsandbytes\cextension.py", line 43, in <module>
    lib = CUDALibrary_Singleton.get_instance().lib
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\bitsandbytes\cextension.py", line 39, in get_instance
    cls._instance.initialize()
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\bitsandbytes\cextension.py", line 27, in initialize
    raise Exception('CUDA SETUP: Setup Failed!')
Exception: CUDA SETUP: Setup Failed!
Traceback (most recent call last):
  File "C:\Users\andre\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\andre\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Kohya\kohya_ss\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
    args.func(args)
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1104, in launch_command
    simple_launcher(args)
  File "C:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 567, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Kohya\\kohya_ss\\venv\\Scripts\\python.exe', 'train_db.py', '--pretrained_model_name_or_path=C:\\sd-shared-files\\models\\sd-v1-5-pruned-emaonly.ckpt', '--train_data_dir=C:/Temp/omgcsply lora/img', '--resolution=512,512', '--output_dir=C:/Temp/omgcsply lora/model', '--logging_dir=C:/Temp/omgcsply lora/log', '--save_model_as=safetensors', '--max_data_loader_n_workers=1', '--learning_rate=0.0001', '--lr_scheduler=constant', '--train_batch_size=1', '--max_train_steps=6800', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=1234', '--caption_extension=.txt', '--cache_latents', '--max_data_loader_n_workers=1', '--clip_skip=2', '--bucket_reso_steps=64', '--mem_eff_attn', '--gradient_checkpointing', '--xformers', '--use_8bit_adam', '--bucket_no_upscale']' returned non-zero exit status 1.
bmaltais commented 1 year ago

Sound like a bad setup if it can't find the CUDA drivers. Make sur you don't already have local pip modules installed outside the venv.

DivinoAG commented 1 year ago

Sound like a bad setup if it can't find the CUDA drivers. Make sur you don't already have local pip modules installed outside the venv.

Could you be so kind as to describe a bit more what you talking about? I'm not sure I understand exactly what you mean. I installed Kohya following the official instructions and did nothing beyond that. I have UIs for SD installed like WebUI and Easy Diffusion, but also didn't do any tinkering beyond what their respective installations require, and they clearly can run CUDA for their respective render processes. So I don't know what "local pip modules" could be affecting this, or how I would go about troubleshooting this.

bmaltais commented 1 year ago

It is hard to tell. I only create the GUI that allow you to use the kohya python code. This error should really be brought to kohya in his main repo as he is the one writing the code. I do my best to help but those type of errors are outside my expertise I am afraid.

DivinoAG commented 1 year ago

I guess I didn't realize they were separate things. I'll take a look at their repo then. Thanks.