Open AndreR opened 1 month ago
(Not OP but) I tried to get around this breakage by changing all problematic instances of to
to to_empty
, which lets it run.
However, even before these changes I already get a ton of warnings like:
/apps/bdi-venv-310-0.1.0-h109.240228c~jammy/lib/python3.10/site-packages/torch/nn/modules/module.py:2025: UserWarning: for decoder.norm_out.bias: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
warnings.warn(f'for {key}: copying from a non-meta parameter in the checkpoint to a meta '
/apps/bdi-venv-310-0.1.0-h109.240228c~jammy/lib/python3.10/site-packages/torch/nn/modules/module.py:2025: UserWarning: for decoder.conv_out.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
warnings.warn(f'for {key}: copying from a non-meta parameter in the checkpoint to a meta '
/apps/bdi-venv-310-0.1.0-h109.240228c~jammy/lib/python3.10/site-packages/torch/nn/modules/module.py:2025: UserWarning: for decoder.conv_out.bias: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module
So perhaps this isn't the right solution..
@carolynsoo So you have the same problem? Any solution so far?
I also had this issue and for me it went away after using a different VAE as far as I remember. I am now using this one for Flux1-Dev: (Flux1DevVAE_stock.safetensors](https://civitai.com/models/735031?modelVersionId=821978)
Here is my setup: https://github.com/bmaltais/kohya_ss/issues/2701#issuecomment-2352433098 (Note I haven't managed to successfully train a Lora yet, but at least I am past this issues)
@maxanier Yeah, now I'm getting another error message:
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
2024-09-19 19:08:59 INFO Building CLIP flux_utils.py:74
INFO Loading state dict from P:/AI/Flux Encoders/clip_l.safetensors flux_utils.py:167
INFO Loaded CLIP: <All keys matched successfully> flux_utils.py:170
INFO Loading state dict from P:/AI/Flux Encoders/t5xxl_fp16.safetensors flux_utils.py:215
INFO Loaded T5xxl: <All keys matched successfully> flux_utils.py:218
INFO Building Flux model dev flux_utils.py:45
INFO Loading state dict from P:/AI/Flux Encoders/flux1-dev.safetensors flux_utils.py:52
2024-09-19 19:09:00 INFO Loaded Flux: <All keys matched successfully> flux_utils.py:55
INFO enable block swap: double_blocks_to_swap=0, single_blocks_to_swap=0 flux_train.py:272
number of trainable parameters: 11901408320
prepare optimizer, data loader etc.
INFO use 8-bit AdamW optimizer | {} train_util.py:4383
override steps. steps for 2 epochs is / 指定エポックまでのステップ数: 320
Traceback (most recent call last):
File "P:\AI\kohya_ss\sd-scripts\flux_train.py", line 908, in <module>
train(args)
File "P:\AI\kohya_ss\sd-scripts\flux_train.py", line 387, in train
lr_scheduler = train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes)
File "P:\AI\kohya_ss\sd-scripts\library\train_util.py", line 4724, in get_scheduler_fix
return schedule_func(
TypeError: get_cosine_schedule_with_warmup() got an unexpected keyword argument 'num_decay_steps'
Traceback (most recent call last):
File "C:\Program Files\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Program Files\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "P:\AI\kohya_ss\venv\Scripts\accelerate.EXE\__main__.py", line 7, in <module>
File "P:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
args.func(args)
File "P:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
simple_launcher(args)
File "P:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['P:\\AI\\kohya_ss\\venv\\Scripts\\python.exe', 'P:/AI/kohya_ss/sd-scripts/flux_train.py', '--config_file', 'P:/AI/LoRA/config_dreambooth-20240919-190842.toml']' returned non-zero exit status 1.
@carolynsoo So you have the same problem? Any solution so far?
I got around it as I described but my LoRA is all 0s or some small values ~2e-6 (which isn’t a total surprise given all those warnings I mentioned earlier) so I gave up using kohya for flux for now since I didn’t want to get into the torch weeds.
@maxanier Yeah, now I'm getting another error message:
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 2024-09-19 19:08:59 INFO Building CLIP flux_utils.py:74 INFO Loading state dict from P:/AI/Flux Encoders/clip_l.safetensors flux_utils.py:167 INFO Loaded CLIP: <All keys matched successfully> flux_utils.py:170 INFO Loading state dict from P:/AI/Flux Encoders/t5xxl_fp16.safetensors flux_utils.py:215 INFO Loaded T5xxl: <All keys matched successfully> flux_utils.py:218 INFO Building Flux model dev flux_utils.py:45 INFO Loading state dict from P:/AI/Flux Encoders/flux1-dev.safetensors flux_utils.py:52 2024-09-19 19:09:00 INFO Loaded Flux: <All keys matched successfully> flux_utils.py:55 INFO enable block swap: double_blocks_to_swap=0, single_blocks_to_swap=0 flux_train.py:272 number of trainable parameters: 11901408320 prepare optimizer, data loader etc. INFO use 8-bit AdamW optimizer | {} train_util.py:4383 override steps. steps for 2 epochs is / 指定エポックまでのステップ数: 320 Traceback (most recent call last): File "P:\AI\kohya_ss\sd-scripts\flux_train.py", line 908, in <module> train(args) File "P:\AI\kohya_ss\sd-scripts\flux_train.py", line 387, in train lr_scheduler = train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes) File "P:\AI\kohya_ss\sd-scripts\library\train_util.py", line 4724, in get_scheduler_fix return schedule_func( TypeError: get_cosine_schedule_with_warmup() got an unexpected keyword argument 'num_decay_steps' Traceback (most recent call last): File "C:\Program Files\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Program Files\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "P:\AI\kohya_ss\venv\Scripts\accelerate.EXE\__main__.py", line 7, in <module> File "P:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main args.func(args) File "P:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command simple_launcher(args) File "P:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['P:\\AI\\kohya_ss\\venv\\Scripts\\python.exe', 'P:/AI/kohya_ss/sd-scripts/flux_train.py', '--config_file', 'P:/AI/LoRA/config_dreambooth-20240919-190842.toml']' returned non-zero exit status 1.
This is an unrelated error, and imo is due to versioning issues between the different repos/packages. Regardless, you can usually dodge it by altering your config file to remove the conflicting args and/or modifying the scripts to tolerate unexpected kwargs.
@carolynsoo I finally got it to work with the 16 bit model, but it's super slow as I have to use the split mode. Do you know reliable alternative software that works with Flux?
Has anyone developed a better understanding of this issue? Was experimenting with trying to train Flux UNets in Dreambooth today. If I set the model to train Flux Dev.1, it worked with no issues (well still having result issues but no errors). I also tried it on a random Flux UNet from Civitai, and it worked fine. However, when I try it on a custom UNet of mine (which was Dev merged with some LoRAs); I get this error. My custom UNet works just like Flux Dev everywhere else, so I cannot figure why it causes this error to throw. Any ideas? Also, if you have a solid config json for a full FLUX checkpoint Finetune (not behind a paywall), I would love a copy of it. Not a LoRA script - I have that working fine.
Has anyone developed a better understanding of this issue? Was experimenting with trying to train Flux UNets in Dreambooth today. If I set the model to train Flux Dev.1, it worked with no issues (well still having result issues but no errors). I also tried it on a random Flux UNet from Civitai, and it worked fine. However, when I try it on a custom UNet of mine (which was Dev merged with some LoRAs); I get this error. My custom UNet works just like Flux Dev everywhere else, so I cannot figure why it causes this error to throw. Any ideas?
Merged flux models that bundle components like the Unet + VAE + text encoders will often have the key names adjusted to identify which weights belong to which component.
When I do a training run with that kind of merged model, there's a very long error like this:
_IncompatibleKeys(missing_keys=['img_in.weight', 'img_in.bias',
'time_in.in_layer.weight', 'time_in.in_layer.bias', 'time_in.out_layer.weight',
'time_in.out_layer.bias', 'vector_in.in_layer.weight', 'vector_in.in_layer.bias',
'vector_in.out_layer.weight', 'vector_in.out_layer.bias',
'guidance_in.in_layer.weight', 'guidance_in.in_layer.bias',
'guidance_in.out_layer.weight', 'guidance_in.out_layer.bias', 'txt_in.weight',
'txt_in.bias', 'double_blocks.0.img_mod.lin.weight',
'double_blocks.0.img_mod.lin.bias', 'double_blocks.0.img_attn.qkv.weight',
'double_blocks.0.img_attn.qkv.bias', 'double_blocks.0.img_attn.norm.query_norm.scale',
'double_blocks.0.img_attn.norm.key_norm.scale', 'double_blocks.0.img_attn.proj.weight',
.......
],
unexpected_keys=['model.diffusion_model.double_blocks.0.img_attn.norm.key_norm.scale',
'model.diffusion_model.double_blocks.0.img_attn.norm.query_norm.scale',
'model.diffusion_model.double_blocks.0.img_attn.proj.bias',
'model.diffusion_model.double_blocks.0.img_attn.proj.weight',
'model.diffusion_model.double_blocks.0.img_attn.qkv.bias',
.......
]
It says "missing_keys" on elements like double_blocks.0.img_attn.......
however they are present with the prefix "model.diffusion_model." which shows a key name of model.diffusion_model.double_blocks.0.img_attn......
for example. So I think the issue here is the training script currently skips loading those Unet weights if they're identified with those prefixed key names, hence the "no data" error.
I want to train a custom model für Flux and downloaded all the necessary files. However, as soon as I start the training, I get the following error:
Any idea how I can fix this? I found some hints that it could have to do with an invalid VAE, but that's the only one I find for the DEV version. By the way, I'm running a 4090.