d8ahazard / sd_dreambooth_extension

Other
1.86k stars 282 forks source link

[Bug]: Exception training model: 'Cannot copy out of meta tensor; no data!'. #1453

Closed joneschunghk closed 6 months ago

joneschunghk commented 8 months ago

Is there an existing issue for this?

What happened?

Error while save interval model

Steps to reproduce the problem

  1. I had downgraded diffusers to 0.25.0 because lora doesn't support diffusers >=0.26.0.
  2. I had upgrade torch to 2.2.0+cu118 and xformers0.0.24+cu118 because xformers 0.0.20 is outdated.
  3. I train a checkpoint with lora enabled.
  4. It caused an error while save interval model

Commit and libraries

Initializing Dreambooth Dreambooth revision: 71c3465b6c866050b147c58e2caf41984df2cf45 Checking xformers... Checking bitsandbytes... Checking bitsandbytes (Windows) Virtual environment path: D:\AI\Stable Diffusion\stable-diffusion-webui\venv Checking for D:\AI\Stable Diffusion\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda111.dll Found windows BNB DLL D:\AI\Stable Diffusion\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda111.dll Checking Dreambooth requirements... Installed version of accelerate: 0.21.0 [Dreambooth] accelerate v0.21.0 is already installed. Installed version of dadaptation: 3.2 [Dreambooth] dadaptation v3.2 is already installed. Installed version of diffusers: 0.25.0 [Dreambooth] diffusers v0.25.0 is already installed. Installed version of discord-webhook: 1.3.0 [Dreambooth] discord-webhook v1.3.0 is already installed. Installed version of fastapi: 0.94.0 [Dreambooth] fastapi is already installed. Installed version of gitpython: 3.1.32 [Dreambooth] gitpython v3.1.40 is not installed. Successfully installed gitpython-3.1.41

Installed version of pytorch_optimizer: 2.12.0 [Dreambooth] pytorch_optimizer v2.12.0 is already installed. Installed version of Pillow: 9.5.0 [Dreambooth] Pillow is already installed. Installed version of tqdm: 4.66.1 [Dreambooth] tqdm is already installed. Installed version of tomesd: 0.1.3 [Dreambooth] tomesd v0.1.2 is already installed. Installed version of tensorboard: 2.13.0 [Dreambooth] tensorboard v2.13.0 is already installed. [+] torch version 2.2.0+cu118 installed. [+] torchvision version 0.17.0+cu118 installed. [+] accelerate version 0.21.0 installed. [+] diffusers version 0.25.0 installed. [+] bitsandbytes version 0.41.2.post2 installed. [+] xformers version 0.0.24+cu118 installed.

Command Line Arguments

--xformers --medvram-sdxl --no-half-vae --autolaunch

Console logs

Total images / batch: 40, total examples: 40███████████████████████████████████████████| 40/40 [00:24<00:00,  1.99it/s]
                  Initializing bucket counter!
Loading pipeline components...: 100%|████████████████████████████████████████████████████| 7/7 [00:00<00:00, 51.46it/s]
Loading pipeline components...: 100%|████████████████████████████████████████████████████| 7/7 [00:24<00:00,  3.50s/it]
Saving Lora Weights...:   0%|                                                                    | 0/1 [00:00<?, ?it/s]Model name: Turbo_v1.050%|████████████████████████████████                                | 2/4 [02:46<02:52, 86.21s/it]
Saving D:\AI\Stable Diffusion\stable-diffusion-webui\models\dreambooth\Turbo_v1.0\logging\loss_plot_18.png
Saving D:\AI\Stable Diffusion\stable-diffusion-webui\models\dreambooth\Turbo_v1.0\logging\ram_plot_18.png
Cleanup log parse.
Steps:  10%|███▋                                 | 400/4000 [31:06<3:26:03,  3.43s/it, loss=0.309, lr=0.0001, vram=6.7]Traceback (most recent call last):                                                                | 0/4 [00:00<?, ?it/s]
  File "D:\AI\Stable Diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\ui_functions.py", line 735, in start_training
    result = main(class_gen_method=class_gen_method)
  File "D:\AI\Stable Diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1976, in main
    return inner_loop()
  File "D:\AI\Stable Diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 126, in decorator
    return function(batch_size, grad_size, prof, *args, **kwargs)
  File "D:\AI\Stable Diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1933, in inner_loop
    check_save(True)
  File "D:\AI\Stable Diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1084, in check_save
    save_weights(
  File "D:\AI\Stable Diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1142, in save_weights
    vae=vae.to(accelerator.device),
  File "D:\AI\Stable Diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1152, in to
    return self._apply(convert)
  File "D:\AI\Stable Diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 802, in _apply
    module._apply(fn)
  File "D:\AI\Stable Diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 802, in _apply
    module._apply(fn)
  File "D:\AI\Stable Diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 825, in _apply
    param_applied = fn(param)
  File "D:\AI\Stable Diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
Steps:  10%|███▋                                 | 400/4000 [31:07<4:40:08,  4.67s/it, loss=0.309, lr=0.0001, vram=6.7]
Duration: 00:32:11
Saving weights/samples...:   0%|                                                                 | 0/4 [00:01<?, ?it/s]
Duration: 00:32:18

Additional information

No response

joneschunghk commented 8 months ago

After some testing: Training epochs=20, Save model epochs=10, Save preview epochs=5. The preview was generated successfully in epoch 5, and the error occured in epoch 10. Training epochs=20, Save model epochs=10, Save preview epochs=0. The preview and model were generated successfully in epoch 10 without errors.

So I guess the error occurs when saving the model and saving the preview in the same epoch. I'm testing a larger training epoch without preview now and waiting for the results.

github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 14 days with no activity. Remove stale label or comment or this will be closed in 30 days