ShivamShrirao / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch
https://huggingface.co/docs/diffusers
Apache License 2.0
1.89k stars 505 forks source link

Dreambooth runs, but doesn't train model or generate class images #208

Open AIRenaissance opened 1 year ago

AIRenaissance commented 1 year ago

Describe the bug

My Diffusers is running but it just doesn't want to train a model based on my settings and I don't know why.

It also does not generate any class images whatsoever so it seems he doesn't even train the existing model.

Also tried different models, no one worked.

What could it be? What should I change or try?

Reproduction

This is my train.sh file:

export MODEL_NAME="dreamlike-art/dreamlike-diffusion-1.0" export INSTANCE_DIR="training" export CLASS_DIR="classes" export OUTPUT_DIR="output"

accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir=$INSTANCE_DIR \ --class_data_dir=$CLASS_DIR \ --output_dir=$OUTPUT_DIR \ --instance_prompt="photo of yface1 person" \ --class_prompt="photo of a person" \ --resolution=512 \ --train_batch_size=1 \ --mixed_precision="fp16" \ --use_8bit_adam \ --gradient_accumulation_steps=1 --gradient_checkpointing \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --num_class_images=200 \ --max_train_steps=800

Logs

(diffusers) babooz@DESKTOP-6IT4DVD:~/github/diffusers/examples/dreambooth$ ./my_training_2.sh
/home/babooz/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/accelerator.py:231: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use `project_dir` instead.
  warnings.warn(
Downloading (…)tokenizer/vocab.json: 100%|█████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 1.18MB/s]
Downloading (…)tokenizer/merges.txt: 100%|████████████████████████████████████████████| 525k/525k [00:00<00:00, 896kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████████████████████████████████████████| 472/472 [00:00<00:00, 105kB/s]
Downloading (…)okenizer_config.json: 100%|██████████████████████████████████████████████| 806/806 [00:00<00:00, 681kB/s]
Downloading (…)_encoder/config.json: 100%|██████████████████████████████████████████████| 592/592 [00:00<00:00, 122kB/s]
Downloading (…)"pytorch_model.bin";: 100%|███████████████████████████████████████████| 492M/492M [04:04<00:00, 2.02MB/s]
Downloading (…)_pytorch_model.bin";: 100%|███████████████████████████████████████████| 335M/335M [02:45<00:00, 2.02MB/s]
Downloading (…)main/vae/config.json: 100%|██████████████████████████████████████████████| 522/522 [00:00<00:00, 182kB/s]
Downloading (…)_pytorch_model.bin";: 100%|█████████████████████████████████████████| 3.44G/3.44G [36:44<00:00, 1.56MB/s]
Downloading (…)ain/unet/config.json: 100%|██████████████████████████████████████████████| 743/743 [00:00<00:00, 336kB/s]
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/babooz/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/babooz/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/configuration_utils.py:195: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
Downloading (…)cheduler_config.json: 100%|█████████████████████████████████████████████| 313/313 [00:00<00:00, 8.45kB/s]
Caching latents: 100%|██████████████████████████████████████████████████████████████████| 11/11 [04:14<00:00, 23.11s/it] 02/16/2023 14:25:39 - INFO - __main__ - ***** Running training *****
02/16/2023 14:25:39 - INFO - __main__ -   Num examples = 11
02/16/2023 14:25:39 - INFO - __main__ -   Num batches each epoch = 11
02/16/2023 14:25:39 - INFO - __main__ -   Num Epochs = 73
02/16/2023 14:25:39 - INFO - __main__ -   Instantaneous batch size per device = 1
02/16/2023 14:25:39 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
02/16/2023 14:25:39 - INFO - __main__ -   Gradient Accumulation steps = 1
02/16/2023 14:25:39 - INFO - __main__ -   Total optimization steps = 800
Downloading (…)ain/model_index.json: 100%|██████████████████████████████████████████████| 543/543 [00:00<00:00, 103kB/s]
Downloading (…)nfig-checkpoint.json: 100%|█████████████████████████████████████████████| 209/209 [00:00<00:00, 12.0kB/s]
Downloading (…)rocessor_config.json: 100%|█████████████████████████████████████████████| 342/342 [00:00<00:00, 46.6kB/s]
Downloading (…)cial_tokens_map.json: 100%|█████████████████████████████████████████████| 472/472 [00:00<00:00, 66.8kB/s]
Downloading (…)_encoder/config.json: 100%|█████████████████████████████████████████████| 592/592 [00:00<00:00, 72.3kB/s]
Downloading (…)_checker/config.json: 100%|██████████████████████████████████████████| 4.56k/4.56k [00:00<00:00, 268kB/s]
Downloading (…)okenizer_config.json: 100%|██████████████████████████████████████████████| 806/806 [00:00<00:00, 183kB/s]
Downloading (…)tokenizer/merges.txt: 100%|████████████████████████████████████████████| 525k/525k [00:01<00:00, 362kB/s]
Downloading (…)tokenizer/vocab.json: 100%|██████████████████████████████████████████| 1.06M/1.06M [00:01<00:00, 658kB/s]
Downloading (…)"pytorch_model.bin";: 100%|████████████████████████████████████████████| 492M/492M [08:29<00:00, 967kB/s]
Downloading (…)"pytorch_model.bin";: 100%|█████████████████████████████████████████| 1.22G/1.22G [16:19<00:00, 1.24MB/s]
Fetching 16 files: 100%|████████████████████████████████████████████████████████████████| 16/16 [16:20<00:00, 61.26s/it]/home/babooz/anaconda3/envs/diffusers/lib/python3.9/site-packages/transformers/models/clip/feature_extraction_clip.py:28: FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead.
  warnings.warn(
You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
[*] Weights saved at output/800
Steps: 100%|█████████████████████████████████████████████████████| 800/800 [19:10<00:00,  1.44s/it, loss=0.149, lr=5e-6]

System Info

AIRenaissance commented 1 year ago

ran "pip install bitsandbytes-cuda117" so it fits my GPU, but it still did not train my model

rajbala commented 1 year ago

I am experiencing the same thing on both this fork and the huggingface/diffusers repo.

kasukanra commented 1 year ago

I also have this issue. The training runs flawlessly...but nothing actually happens. I set up this repository on a new machine last week in WSL2 Ubuntu 20.04, so it may have been a recent change.

rajbala commented 1 year ago

@kudou-reira Do you know the commit of the last working version?

kasukanra commented 1 year ago

@kudou-reira Do you know the commit of the last working version?

main 47f456e [origin/main] Update to script for ckpt conversion of 2.0 models (#169)

is the commit that works on a different machine. I did a rollback on my newer machine to that commit and am currently running a training. We'll see if it works.

flixmk commented 1 year ago

I also had this issue. It works perfectly fine on colab but not on my machine. Colab ussed CUDA 11.6 while I am using CUDA 11.7 aswell. It appears that your PyTorch version in combination with your CUDA version does not work with xformers.

An explanation is given here: https://github.com/facebookresearch/xformers/issues/631#issuecomment-1414421325

What worked for me with the same PyTorch and CUDA versions:

  1. pip uninstall xformers
  2. pip install xformers==0.0.17.dev447
rajbala commented 1 year ago

In my case I am not using xformers so there's something else going on.

kasukanra commented 1 year ago

I reinstalled new xformers, but my trained model is still giving incorrect results. The problem before was that the model was finetuned, but did nothing. The new problem is that the finetuned image now permeates the entire model pretty badly even though I have a sufficient number of regularization images.

rajbala commented 1 year ago

I also had this issue. It works perfectly fine on colab but not on my machine. Colab ussed CUDA 11.6 while I am using CUDA 11.7 aswell. It appears that your PyTorch version in combination with your CUDA version does not work with xformers.

An explanation is given here: facebookresearch/xformers#631 (comment)

What worked for me with the same PyTorch and CUDA versions:

  1. pip uninstall xformers
  2. pip install xformers==0.0.17.dev447

I have CUDA 11.7 installed on the host and using PyTorch 1.13.1+cu117. Does that match what you are using?

I also installed xformers and am using it in my script, but it does not make a difference.