bmaltais / kohya_ss

Apache License 2.0
9.64k stars 1.24k forks source link

"Cannot copy out of meta tensor; no data!" right at the beginning of the training #2820

Open AndreR opened 1 month ago

AndreR commented 1 month ago

I want to train a custom model für Flux and downloaded all the necessary files. However, as soon as I start the training, I get the following error:

import network module: networks.lora_flux
Traceback (most recent call last):
  File "P:\AI\kohya_ss\sd-scripts\flux_train_network.py", line 519, in <module>
    trainer.train(args)
  File "P:\AI\kohya_ss\sd-scripts\train_network.py", line 383, in train
    vae.to(accelerator.device, dtype=vae_dtype)
  File "P:\AI\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1174, in to
    return self._apply(convert)
  File "P:\AI\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 780, in _apply
    module._apply(fn)
  File "P:\AI\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 780, in _apply
    module._apply(fn)
  File "P:\AI\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 780, in _apply
    module._apply(fn)
  [Previous line repeated 3 more times]
  File "P:\AI\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 805, in _apply
    param_applied = fn(param)
  File "P:\AI\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1167, in convert
    raise NotImplementedError(
NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
Traceback (most recent call last):
  File "C:\Program Files\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "P:\AI\kohya_ss\venv\Scripts\accelerate.EXE\__main__.py", line 7, in <module>
  File "P:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
    args.func(args)
  File "P:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
    simple_launcher(args)
  File "P:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['P:\\AI\\kohya_ss\\venv\\Scripts\\python.exe', 'P:/AI/kohya_ss/sd-scripts/flux_train_network.py', '--config_file', 'P:/AI/LoRA/config_lora-20240914-065124.toml']' returned non-zero exit status 1.
06:51:37-215508 INFO     Training has ended.

Any idea how I can fix this? I found some hints that it could have to do with an invalid VAE, but that's the only one I find for the DEV version. By the way, I'm running a 4090.

carolynsoo commented 1 month ago

(Not OP but) I tried to get around this breakage by changing all problematic instances of to to to_empty, which lets it run.

However, even before these changes I already get a ton of warnings like:

/apps/bdi-venv-310-0.1.0-h109.240228c~jammy/lib/python3.10/site-packages/torch/nn/modules/module.py:2025: UserWarning: for decoder.norm_out.bias: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
  warnings.warn(f'for {key}: copying from a non-meta parameter in the checkpoint to a meta '
/apps/bdi-venv-310-0.1.0-h109.240228c~jammy/lib/python3.10/site-packages/torch/nn/modules/module.py:2025: UserWarning: for decoder.conv_out.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)
  warnings.warn(f'for {key}: copying from a non-meta parameter in the checkpoint to a meta '
/apps/bdi-venv-310-0.1.0-h109.240228c~jammy/lib/python3.10/site-packages/torch/nn/modules/module.py:2025: UserWarning: for decoder.conv_out.bias: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module 

So perhaps this isn't the right solution..

AndreR commented 1 month ago

@carolynsoo So you have the same problem? Any solution so far?

maxanier commented 1 month ago

I also had this issue and for me it went away after using a different VAE as far as I remember. I am now using this one for Flux1-Dev: (Flux1DevVAE_stock.safetensors](https://civitai.com/models/735031?modelVersionId=821978)

Here is my setup: https://github.com/bmaltais/kohya_ss/issues/2701#issuecomment-2352433098 (Note I haven't managed to successfully train a Lora yet, but at least I am past this issues)

AndreR commented 1 month ago

@maxanier Yeah, now I'm getting another error message:

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
2024-09-19 19:08:59 INFO     Building CLIP                                                              flux_utils.py:74
                    INFO     Loading state dict from P:/AI/Flux Encoders/clip_l.safetensors            flux_utils.py:167
                    INFO     Loaded CLIP: <All keys matched successfully>                              flux_utils.py:170
                    INFO     Loading state dict from P:/AI/Flux Encoders/t5xxl_fp16.safetensors        flux_utils.py:215
                    INFO     Loaded T5xxl: <All keys matched successfully>                             flux_utils.py:218
                    INFO     Building Flux model dev                                                    flux_utils.py:45
                    INFO     Loading state dict from P:/AI/Flux Encoders/flux1-dev.safetensors          flux_utils.py:52
2024-09-19 19:09:00 INFO     Loaded Flux: <All keys matched successfully>                               flux_utils.py:55
                    INFO     enable block swap: double_blocks_to_swap=0, single_blocks_to_swap=0       flux_train.py:272
number of trainable parameters: 11901408320
prepare optimizer, data loader etc.
                    INFO     use 8-bit AdamW optimizer | {}                                           train_util.py:4383
override steps. steps for 2 epochs is / 指定エポックまでのステップ数: 320
Traceback (most recent call last):
  File "P:\AI\kohya_ss\sd-scripts\flux_train.py", line 908, in <module>
    train(args)
  File "P:\AI\kohya_ss\sd-scripts\flux_train.py", line 387, in train
    lr_scheduler = train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes)
  File "P:\AI\kohya_ss\sd-scripts\library\train_util.py", line 4724, in get_scheduler_fix
    return schedule_func(
TypeError: get_cosine_schedule_with_warmup() got an unexpected keyword argument 'num_decay_steps'
Traceback (most recent call last):
  File "C:\Program Files\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "P:\AI\kohya_ss\venv\Scripts\accelerate.EXE\__main__.py", line 7, in <module>
  File "P:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
    args.func(args)
  File "P:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
    simple_launcher(args)
  File "P:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['P:\\AI\\kohya_ss\\venv\\Scripts\\python.exe', 'P:/AI/kohya_ss/sd-scripts/flux_train.py', '--config_file', 'P:/AI/LoRA/config_dreambooth-20240919-190842.toml']' returned non-zero exit status 1.
carolynsoo commented 1 month ago

@carolynsoo So you have the same problem? Any solution so far?

I got around it as I described but my LoRA is all 0s or some small values ~2e-6 (which isn’t a total surprise given all those warnings I mentioned earlier) so I gave up using kohya for flux for now since I didn’t want to get into the torch weeds.

carolynsoo commented 1 month ago

@maxanier Yeah, now I'm getting another error message:

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
2024-09-19 19:08:59 INFO     Building CLIP                                                              flux_utils.py:74
                    INFO     Loading state dict from P:/AI/Flux Encoders/clip_l.safetensors            flux_utils.py:167
                    INFO     Loaded CLIP: <All keys matched successfully>                              flux_utils.py:170
                    INFO     Loading state dict from P:/AI/Flux Encoders/t5xxl_fp16.safetensors        flux_utils.py:215
                    INFO     Loaded T5xxl: <All keys matched successfully>                             flux_utils.py:218
                    INFO     Building Flux model dev                                                    flux_utils.py:45
                    INFO     Loading state dict from P:/AI/Flux Encoders/flux1-dev.safetensors          flux_utils.py:52
2024-09-19 19:09:00 INFO     Loaded Flux: <All keys matched successfully>                               flux_utils.py:55
                    INFO     enable block swap: double_blocks_to_swap=0, single_blocks_to_swap=0       flux_train.py:272
number of trainable parameters: 11901408320
prepare optimizer, data loader etc.
                    INFO     use 8-bit AdamW optimizer | {}                                           train_util.py:4383
override steps. steps for 2 epochs is / 指定エポックまでのステップ数: 320
Traceback (most recent call last):
  File "P:\AI\kohya_ss\sd-scripts\flux_train.py", line 908, in <module>
    train(args)
  File "P:\AI\kohya_ss\sd-scripts\flux_train.py", line 387, in train
    lr_scheduler = train_util.get_scheduler_fix(args, optimizer, accelerator.num_processes)
  File "P:\AI\kohya_ss\sd-scripts\library\train_util.py", line 4724, in get_scheduler_fix
    return schedule_func(
TypeError: get_cosine_schedule_with_warmup() got an unexpected keyword argument 'num_decay_steps'
Traceback (most recent call last):
  File "C:\Program Files\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "P:\AI\kohya_ss\venv\Scripts\accelerate.EXE\__main__.py", line 7, in <module>
  File "P:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
    args.func(args)
  File "P:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
    simple_launcher(args)
  File "P:\AI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['P:\\AI\\kohya_ss\\venv\\Scripts\\python.exe', 'P:/AI/kohya_ss/sd-scripts/flux_train.py', '--config_file', 'P:/AI/LoRA/config_dreambooth-20240919-190842.toml']' returned non-zero exit status 1.

This is an unrelated error, and imo is due to versioning issues between the different repos/packages. Regardless, you can usually dodge it by altering your config file to remove the conflicting args and/or modifying the scripts to tolerate unexpected kwargs.

AndreR commented 1 month ago

@carolynsoo I finally got it to work with the 16 bit model, but it's super slow as I have to use the split mode. Do you know reliable alternative software that works with Flux?

pAInCREAT0R commented 1 month ago

Has anyone developed a better understanding of this issue? Was experimenting with trying to train Flux UNets in Dreambooth today. If I set the model to train Flux Dev.1, it worked with no issues (well still having result issues but no errors). I also tried it on a random Flux UNet from Civitai, and it worked fine. However, when I try it on a custom UNet of mine (which was Dev merged with some LoRAs); I get this error. My custom UNet works just like Flux Dev everywhere else, so I cannot figure why it causes this error to throw. Any ideas? Also, if you have a solid config json for a full FLUX checkpoint Finetune (not behind a paywall), I would love a copy of it. Not a LoRA script - I have that working fine.

b-fission commented 1 month ago

Has anyone developed a better understanding of this issue? Was experimenting with trying to train Flux UNets in Dreambooth today. If I set the model to train Flux Dev.1, it worked with no issues (well still having result issues but no errors). I also tried it on a random Flux UNet from Civitai, and it worked fine. However, when I try it on a custom UNet of mine (which was Dev merged with some LoRAs); I get this error. My custom UNet works just like Flux Dev everywhere else, so I cannot figure why it causes this error to throw. Any ideas?

Merged flux models that bundle components like the Unet + VAE + text encoders will often have the key names adjusted to identify which weights belong to which component.

When I do a training run with that kind of merged model, there's a very long error like this:

_IncompatibleKeys(missing_keys=['img_in.weight', 'img_in.bias',
    'time_in.in_layer.weight', 'time_in.in_layer.bias', 'time_in.out_layer.weight',                         
    'time_in.out_layer.bias', 'vector_in.in_layer.weight', 'vector_in.in_layer.bias', 
    'vector_in.out_layer.weight', 'vector_in.out_layer.bias',
    'guidance_in.in_layer.weight', 'guidance_in.in_layer.bias',
    'guidance_in.out_layer.weight', 'guidance_in.out_layer.bias', 'txt_in.weight',
    'txt_in.bias', 'double_blocks.0.img_mod.lin.weight',
    'double_blocks.0.img_mod.lin.bias', 'double_blocks.0.img_attn.qkv.weight',
    'double_blocks.0.img_attn.qkv.bias', 'double_blocks.0.img_attn.norm.query_norm.scale',
    'double_blocks.0.img_attn.norm.key_norm.scale', 'double_blocks.0.img_attn.proj.weight',
   ....... 
],
unexpected_keys=['model.diffusion_model.double_blocks.0.img_attn.norm.key_norm.scale',                  
    'model.diffusion_model.double_blocks.0.img_attn.norm.query_norm.scale',                                 
    'model.diffusion_model.double_blocks.0.img_attn.proj.bias',                                             
    'model.diffusion_model.double_blocks.0.img_attn.proj.weight',                                           
    'model.diffusion_model.double_blocks.0.img_attn.qkv.bias',                         
   ....... 
]

It says "missing_keys" on elements like double_blocks.0.img_attn....... however they are present with the prefix "model.diffusion_model." which shows a key name of model.diffusion_model.double_blocks.0.img_attn...... for example. So I think the issue here is the training script currently skips loading those Unet weights if they're identified with those prefixed key names, hence the "no data" error.