Open ironninja33 opened 2 years ago
Also getting the same error with cuda 11.6 and python 3.9 on bare metal.
I narrowed it down: when I pass the argument --train_text_encoder, I get the error above. Otherwise, when I remove that argument, I can train models.
Sorry deepspeed support can have some problems. I won't be able to fix right now.
Describe the bug
Running the instructions for 8GB VRAM, I get the following error:
Traceback (most recent call last): File "train_dreambooth.py", line 765, in
main()
File "train_dreambooth.py", line 597, in main
unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 619, in prepare
result = self._prepare_deepspeed(*args)
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 805, in _preparedeepspeed
engine, optimizer, , lr_scheduler = deepspeed.initialize(**kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/init.py", line 124, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 327, in init
self._configure_optimizer(optimizer, model_parameters)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1150, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1401, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 530, in init
self._param_slice_mappings = self._create_param_mapping()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 542, in _create_param_mapping
lp_name = self.param_names[lp]
KeyError: Parameter containing:
tensor([[[[-0.0277, 0.0744, 0.0869],
[-0.0260, -0.1979, 0.1300],
[-0.0211, -0.0179, 0.0277]],
...
Reproduction
Here is how I am launching train_dreambooth.py:
accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --output_dir=$OUTPUT_DIR \ --instance_data_dir=$INSTANCE_DIR \ --class_data_dir=$CLASS_DIR \ --pretrained_vae_name_or_path="stabilityai/sd-vae-ft-mse" \ --with_prior_preservation --prior_loss_weight=1.0 \ --resolution=512 \ --train_batch_size=1 \ --train_text_encoder \ --mixed_precision="fp16" \ --gradient_accumulation_steps=1 --gradient_checkpointing \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --num_class_images=200 \ --sample_batch_size=1 \ --max_train_steps=800 \ --save_interval=400 \ --class_prompt="a photo of a person" \ --instance_prompt="a photo of sks person"
Logs
No response
System Info
diffusers
version: 0.7.0.dev0I am running this inside the latest version of nvidia-docker, 22.09.