Error when running a training session.

iqddd commented 1 year ago

Trying to resume network training based on v2.1 768px. Trying to resume network training based on v2.1 768px. Almost immediately I get an error.

Resuming Training...
Training the UNet...
'########:'########:::::'###::::'####:'##::: ##:'####:'##::: ##::'######:::
... ##..:: ##.... ##:::'## ##:::. ##:: ###:: ##:. ##:: ###:: ##:'##... ##::
::: ##:::: ##:::: ##::'##:. ##::: ##:: ####: ##:: ##:: ####: ##: ##:::..:::
::: ##:::: ########::'##:::. ##:: ##:: ## ## ##:: ##:: ## ## ##: ##::'####:
::: ##:::: ##.. ##::: #########:: ##:: ##. ####:: ##:: ##. ####: ##::: ##::
::: ##:::: ##::. ##:: ##.... ##:: ##:: ##:. ###:: ##:: ##:. ###: ##::: ##::
::: ##:::: ##:::. ##: ##:::: ##:'####: ##::. ##:'####: ##::. ##:. ######:::
:::..:::::..:::::..::..:::::..::....::..::::..::....::..::::..:::......::::

2023-02-19 10:19:49.512117: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-19 10:19:53.707580: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-02-19 10:19:53.708294: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-02-19 10:19:53.708348: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
  0% 0/3000 [00:00<?, ?it/s] JrCr   JrCr  Traceback (most recent call last):
  File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 789, in <module>
    main()
  File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 676, in main
    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 507, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.8/dist-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_condition.py", line 339, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_blocks.py", line 637, in forward
    hidden_states = torch.utils.checkpoint.checkpoint(
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_blocks.py", line 630, in custom_forward
    return module(*inputs, return_dict=return_dict)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/attention.py", line 213, in forward
    hidden_states = self.proj_in(hidden_states)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: t() expects a tensor with <= 2 dimensions, but self is 4D
  0% 0/3000 [00:12<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

upd: There are no errors during training on v1.5.

TheLastBen commented 1 year ago

did you install any dependency during the session ?

iqddd commented 1 year ago

Only those in the "Dependencies" cell. Followed the usual procedure. Sequential startup:

Mounting GDrive.
Dependencies cell.
Create/Load a Session cell.
Start DreamBooth cell.

TheLastBen commented 1 year ago

what model ? default or a custom one

iqddd commented 1 year ago

Based on default SD2.1 768px.

TheLastBen commented 1 year ago

try restarting the session, and use the latest colab

iqddd commented 1 year ago

What do you mean by "use the latest colab". https://colab.research.google.com/github/TheLastBen/fast-stable-diffusion/blob/main/fast-DreamBooth.ipynb I think the Colab at the link above is always the latest. Isn't it?

TheLastBen commented 1 year ago

in the latest colab, the tensorflow msg doesn't show

iqddd commented 1 year ago

How do I switch to the latest Colab?

TheLastBen commented 1 year ago

the link above is correct

ygdeyan commented 1 year ago

I also ran into an error using the latest Colab (the link above) today. Not seeing the tensorflow msg so I guess it's another issue?


  0% 0/1000 [00:00<?, ?it/s] tshirt tshirt Traceback (most recent call last):
  File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 789, in <module>
    main()
  File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 676, in main
    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 507, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.8/dist-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_condition.py", line 339, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_blocks.py", line 637, in forward
    hidden_states = torch.utils.checkpoint.checkpoint(
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_blocks.py", line 630, in custom_forward
    return module(*inputs, return_dict=return_dict)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/attention.py", line 213, in forward
    hidden_states = self.proj_in(hidden_states)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: t() expects a tensor with <= 2 dimensions, but self is 4D
  0% 0/1000 [00:07<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/diffusers/examples/dreambooth/train_dreambooth.py', '--image_captions_filename', '--train_only_unet', '--save_starting_step=325', '--save_n_steps=325', '--Session_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/lapitadress', '--pretrained_model_name_or_path=/content/stable-diffusion-custom', '--instance_data_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/lapitadress/instance_images', '--output_dir=/content/models/lapitadress', '--captions_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/lapitadress/captions', '--instance_prompt=', '--seed=247655', '--resolution=768', '--mixed_precision=fp16', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--use_8bit_adam', '--learning_rate=2e-06', '--lr_scheduler=linear', '--lr_warmup_steps=0', '--max_train_steps=1000']' returned non-zero exit status 1.
Something went wrong

iqddd commented 1 year ago

Seems like resuming training for models based on SD2.1-768px is broken. Resuming training for SD2.1-512px and SD1.5 works fine.

TheLastBen commented 1 year ago

resuming the training or resuming the session and training after disconnecting ?

iqddd commented 1 year ago

Resuming the training (run "Start DreamBooth" cell with "Resume training" checkbox selected)

TheLastBen commented 1 year ago

I'll check it out

TheLastBen commented 1 year ago

fixed

PuliDalun commented 1 year ago

Training

Training the UNet... Traceback (most recent call last): File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 789, in main() File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 436, in main accelerator = Accelerator( File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 286, in init raise ValueError(err.format(mode="fp16", requirement="a GPU")) ValueError: fp16 mixed precision requires a GPU Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main args.func(args) File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 837, in launch_command simple_launcher(args) File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/diffusers/examples/dreambooth/train_dreambooth.py', '--image_captions_filename', '--train_only_unet', '--save_starting_step=500', '--save_n_steps=0', '--Session_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/PuliDADA02241330', '--pretrained_model_name_or_path=/content/stable-diffusion-v1-5', '--instance_data_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/PuliDADA02241330/instance_images', '--output_dir=/content/models/PuliDADA02241330', '--captions_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/PuliDADA02241330/captions', '--instance_prompt=', '--seed=959221', '--resolution=512', '--mixed_precision=fp16', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--use_8bit_adam', '--learning_rate=5e-06', '--lr_scheduler=linear', '--lr_warmup_steps=0', '--max_train_steps=1500']' returned non-zero exit status 1. Something went wrong

TheLastBen commented 1 year ago

make sure you set your session to GPU

TheLastBen / fast-stable-diffusion

Error when running a training session. #1607