Dreambooth training: KeyError

ironninja33 commented 2 years ago

Describe the bug

Running the instructions for 8GB VRAM, I get the following error:

Traceback (most recent call last): File "train_dreambooth.py", line 765, in main() File "train_dreambooth.py", line 597, in main unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare( File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 619, in prepare result = self._prepare_deepspeed(*args) File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 805, in _preparedeepspeed engine, optimizer, , lr_scheduler = deepspeed.initialize(**kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/init.py", line 124, in initialize engine = DeepSpeedEngine(args=args, File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 327, in init self._configure_optimizer(optimizer, model_parameters) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1150, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1401, in _configure_zero_optimizer optimizer = DeepSpeedZeroOptimizer( File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 530, in init self._param_slice_mappings = self._create_param_mapping() File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 542, in _create_param_mapping lp_name = self.param_names[lp] KeyError: Parameter containing: tensor([[[[-0.0277, 0.0744, 0.0869], [-0.0260, -0.1979, 0.1300], [-0.0211, -0.0179, 0.0277]],

...

     [[-0.0401, -0.0420, -0.0073],
      [ 0.0336,  0.0244,  0.0278],
      [ 0.0489, -0.0019,  0.0122]]]], device='cuda:0', requires_grad=True)

Reproduction

Here is how I am launching train_dreambooth.py:

accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --output_dir=$OUTPUT_DIR \ --instance_data_dir=$INSTANCE_DIR \ --class_data_dir=$CLASS_DIR \ --pretrained_vae_name_or_path="stabilityai/sd-vae-ft-mse" \ --with_prior_preservation --prior_loss_weight=1.0 \ --resolution=512 \ --train_batch_size=1 \ --train_text_encoder \ --mixed_precision="fp16" \ --gradient_accumulation_steps=1 --gradient_checkpointing \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --num_class_images=200 \ --sample_batch_size=1 \ --max_train_steps=800 \ --save_interval=400 \ --class_prompt="a photo of a person" \ --instance_prompt="a photo of sks person"

Logs

No response

System Info

diffusers version: 0.7.0.dev0
Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.10
Python version: 3.8.13
PyTorch version (GPU?): 1.13.0a0+d0d6b1f (True)
Huggingface_hub version: 0.10.1
Transformers version: 4.23.1
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

I am running this inside the latest version of nvidia-docker, 22.09.

ironninja33 commented 2 years ago

Also getting the same error with cuda 11.6 and python 3.9 on bare metal.

ironninja33 commented 2 years ago

I narrowed it down: when I pass the argument --train_text_encoder, I get the error above. Otherwise, when I remove that argument, I can train models.

ShivamShrirao commented 2 years ago

Sorry deepspeed support can have some problems. I won't be able to fix right now.

ShivamShrirao / diffusers