TheLastBen / fast-stable-diffusion

fast-stable-diffusion + DreamBooth
MIT License
7.51k stars 1.31k forks source link

Dreambooth issue - unable to launch training sequence without failure #472

Open archimedesinstitute opened 1 year ago

archimedesinstitute commented 1 year ago

Hello!

I have been using your dreambooth colab for a couple weeks, and it has always worked incredibly well.

I was launching a new training set last night, and although I was following the steps exactly as i always have, and using instances I have used before on another version of the finetune I have been working on, I have consistently gotten the following error after the instance images are attempted to be pulled from the the folder they have been uploaded to:

--max_train_steps=4000']' returned non-zero exit status 1. Something went wrong

Please let me know if there is something i have overlooked. I did confirm they have been uploaded correctly as I can see them in the Instances folder.

archimedesinstitute commented 1 year ago

Ive gone ahead and reinstalled everything so we'll see how that goes!

TheLastBen commented 1 year ago

copy the whole error log

archimedesinstitute commented 1 year ago

Training the unet... Traceback (most recent call last): File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 785, in main() File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 478, in main text_encoder = CLIPTextModel.from_pretrained(args.output_dir, subfolder="text_encoder") File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py", line 1977, in from_pretrained kwargs, File "/usr/local/lib/python3.7/dist-packages/transformers/models/clip/configuration_clip.py", line 133, in from_pretrained config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py", line 558, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py", line 625, in _get_config_dict _commit_hash=commit_hash, File "/usr/local/lib/python3.7/dist-packages/transformers/utils/hub.py", line 381, in cached_file f"{path_or_repo_id} does not appear to have a file named {full_filename}. Checkout " OSError: /content/models/Fortuna_Lofi does not appear to have a file named text_encoder/config.json. Checkout 'https://huggingface.co//content/models/Fortuna_Lofi/None' for available files. Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main args.func(args) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 837, in launch_command simple_launcher(args) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/diffusers/examples/dreambooth/train_dreambooth.py', '--image_captions_filename', '--train_only_unet', '--Session_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/Fortuna_Lofi', '--save_starting_step=500', '--save_n_steps=0', '--pretrained_model_name_or_path=/content/stable-diffusion-v1-5', '--instance_data_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/Fortuna_Lofi/instance_images', '--output_dir=/content/models/Fortuna_Lofi', '--instance_prompt=', '--seed=422995', '--resolution=1024', '--mixed_precision=fp16', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--use_8bit_adam', '--learning_rate=2e-6', '--lr_scheduler=polynomial', '--lr_warmup_steps=0', '--max_train_steps=4600']' returned non-zero exit status 1. Something went wrong

TheLastBen commented 1 year ago

"/content/models/Fortuna_Lofi does not appear to have a file named text_encoder/config.json"

make sure there is a folder named "Fortuna_Lofi" in "/content/models/"

archimedesinstitute commented 1 year ago

In the sd folder rather than the fast-dreambooth, I assume? fast-dreambooth only has a Sessions folder, which leads to the individual named session folders.

archimedesinstitute commented 1 year ago

or I should say fast-dreambooth>sessions>[folder name] is how it currently installs automatically.

archimedesinstitute commented 1 year ago

Screenshot 2022-11-12 151056

This is how it's automatically setting up.

TheLastBen commented 1 year ago

you can't resume training, there is no model in the output folder

archimedesinstitute commented 1 year ago

I am not trying to resume training, I set up a new model.

archimedesinstitute commented 1 year ago

I just ran a backup version of the fast-dreambooth model saved before yesterday and it's running fine, so it seems as though perhaps if there have been any updates, it is no longer automatically setting things correctly

archimedesinstitute commented 1 year ago

This seems still to be an issue. Any clarity on other fixes that can be applied?

TheLastBen commented 1 year ago

I think it's fixed now

archimedesinstitute commented 1 year ago

Unfortunately when I tried it on a new model the same result. Let me know if it's something I'm doing? - File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 787, in main() File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 656, in main noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/usr/local/lib/python3.7/dist-packages/accelerate/utils/operations.py", line 507, in call return convert_to_fp32(self.model_forward(*args, *kwargs)) File "/usr/local/lib/python3.7/dist-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast return func(args, kwargs) File "/usr/local/lib/python3.7/dist-packages/diffusers/models/unet_2d_condition.py", line 314, in forward upsample_size=upsample_size, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/usr/local/lib/python3.7/dist-packages/diffusers/models/unet_blocks.py", line 1150, in forward hidden_states = resnet(hidden_states, temb) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.7/dist-packages/diffusers/models/resnet.py", line 359, in forward hidden_states = self.norm1(hidden_states) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/normalization.py", line 273, in forward input, self.num_groups, self.weight, self.bias, self.eps) File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2516, in group_norm return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 14.76 GiB total capacity; 13.36 GiB already allocated; 11.75 MiB free; 13.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Progress:| | 0% 1/4000 [00:15<17:10:12, 15.46s/it, loss=0.00427, lr=2e-6] Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main args.func(args) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 837, in launch_command simple_launcher(args) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/diffusers/examples/dreambooth/train_dreambooth.py', '--image_captions_filename', '--train_text_encoder', '--save_starting_step=500', '--stop_text_encoder_training=1000', '--save_n_steps=500', '--Session_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/mint_lofi_style', '--pretrained_model_name_or_path=/content/stable-diffusion-v1-5', '--instance_data_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/mint_lofi_style/instance_images', '--output_dir=/content/models/mint_lofi_style', '--instance_prompt=', '--seed=996478', '--resolution=1024', '--mixed_precision=fp16', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--use_8bit_adam', '--learning_rate=2e-6', '--lr_scheduler=polynomial', '--lr_warmup_steps=0', '--max_train_steps=4000']' returned non-zero exit status 1. Something went wrong

TheLastBen commented 1 year ago

1024 resolution is too much, use 704 or 768 (and checking the box reduce memory usage)

archimedesinstitute commented 1 year ago

Ok cool, thanks!