Bug Report: Fine-tuning & Dreambooth collapse when using multi-gpus

2023-04-17 16:36:29.428 | INFO     | __main__:build_data:283 - len(train_dataset): 40
100%|████████████████████████████████████████████████████████████████████████| 40/40 [00:08<00:00,  4.65it/s]
2023-04-17 16:36:41.156 | INFO     | __main__:train:355 - ***** Running training *****
2023-04-17 16:36:41.157 | INFO     | __main__:train:356 -   Num batches each epoch = 2
2023-04-17 16:36:41.157 | INFO     | __main__:train:357 -   Num Steps = 20000
2023-04-17 16:36:41.157 | INFO     | __main__:train:358 -   Instantaneous batch size per device = 1
2023-04-17 16:36:41.157 | INFO     | __main__:train:359 -   Total train batch size (w. parallel, distributed & accumulation) = 5
2023-04-17 16:36:41.157 | INFO     | __main__:train:360 -   Gradient Accumulation steps = 1
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/yabin/miniconda3/envs/hcp/lib/python3.10/runpy.py:196 in _run_module_as_main               │
│                                                                                                  │
│   193 │   main_globals = sys.modules["__main__"].__dict__                                        │
│   194 │   if alter_argv:                                                                         │
│   195 │   │   sys.argv[0] = mod_spec.origin                                                      │
│ ❱ 196 │   return _run_code(code, main_globals, None,                                             │
│   197 │   │   │   │   │    "__main__", mod_spec)                                                 │
│   198                                                                                            │
│   199 def run_module(mod_name, init_globals=None,                                                │
│                                                                                                  │
│ /home/yabin/miniconda3/envs/hcp/lib/python3.10/runpy.py:86 in _run_code                          │
│                                                                                                  │
│    83 │   │   │   │   │      __loader__ = loader,                                                │
│    84 │   │   │   │   │      __package__ = pkg_name,                                             │
│    85 │   │   │   │   │      __spec__ = mod_spec)                                                │
│ ❱  86 │   exec(code, run_globals)                                                                │
│    87 │   return run_globals                                                                     │
│    88                                                                                            │
│    89 def _run_module_code(code, init_globals=None,                                              │
│                                                                                                  │
│ /home/yabin/HCP-Diffusion/hcpdiff/train_ac.py:532 in <module>                                    │
│                                                                                                  │
│   529 │                                                                                          │
│   530 │   conf = load_config_with_cli(args.cfg, args_list=sys.argv[3:]) # skip --cfg             │
│   531 │   trainer=Trainer(conf)                                                                  │
│ ❱ 532 │   trainer.train()                                                                        │
│   533                                                                                            │
│                                                                                                  │
│ /home/yabin/HCP-Diffusion/hcpdiff/train_ac.py:370 in train                                       │
│                                                                                                  │
│   367 │   │                                                                                      │
│   368 │   │   loss_sum=0                                                                         │
│   369 │   │   for image, att_mask, prompt_ids in cycle_data(self.train_loader, arb=self.arb_is   │
│ ❱ 370 │   │   │   loss=self.train_one_step(image, att_mask, prompt_ids)                          │
│   371 │   │   │   loss_sum+=loss                                                                 │
│   372 │   │   │                                                                                  │
│   373 │   │   │   self.global_step += 1                                                          │
│                                                                                                  │
│ /home/yabin/HCP-Diffusion/hcpdiff/train_ac.py:463 in train_one_step                              │
│                                                                                                  │
│   460 │   │   │   else:                                                                          │
│   461 │   │   │   │   loss = self.get_loss(model_pred, target, att_mask)                         │
│   462 │   │   │                                                                                  │
│ ❱ 463 │   │   │   self.accelerator.backward(loss)                                                │
│   464 │   │   │                                                                                  │
│   465 │   │   │   if hasattr(self, 'optimizer'):                                                 │
│   466 │   │   │   │   if self.accelerator.sync_gradients: # fine-tuning                          │
│                                                                                                  │
│ /home/yabin/miniconda3/envs/hcp/lib/python3.10/site-packages/accelerate/accelerator.py:1681 in   │
│ backward                                                                                         │
│                                                                                                  │
│   1678 │   │   elif self.distributed_type == DistributedType.MEGATRON_LM:                        │
│   1679 │   │   │   return                                                                        │
│   1680 │   │   elif self.scaler is not None:                                                     │
│ ❱ 1681 │   │   │   self.scaler.scale(loss).backward(**kwargs)                                    │
│   1682 │   │   else:                                                                             │
│   1683 │   │   │   loss.backward(**kwargs)                                                       │
│   1684                                                                                           │
│                                                                                                  │
│ /home/yabin/miniconda3/envs/hcp/lib/python3.10/site-packages/torch/_tensor.py:487 in backward    │
│                                                                                                  │
│    484 │   │   │   │   create_graph=create_graph,                                                │
│    485 │   │   │   │   inputs=inputs,                                                            │
│    486 │   │   │   )                                                                             │
│ ❱  487 │   │   torch.autograd.backward(                                                          │
│    488 │   │   │   self, gradient, retain_graph, create_graph, inputs=inputs                     │
│    489 │   │   )                                                                                 │
│    490                                                                                           │
│                                                                                                  │
│ /home/yabin/miniconda3/envs/hcp/lib/python3.10/site-packages/torch/autograd/__init__.py:200 in   │
│ backward                                                                                         │
│                                                                                                  │
│   197 │   # The reason we repeat same the comment below is that                                  │
│   198 │   # some Python versions print out the first line of a multi-line function               │
│   199 │   # calls in the traceback and some print out the last line                              │
│ ❱ 200 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   201 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        │
│   202 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   │
│   203                                                                                            │
│                                                                                                  │
│ /home/yabin/miniconda3/envs/hcp/lib/python3.10/site-packages/torch/autograd/function.py:274 in   │
│ apply                                                                                            │
│                                                                                                  │
│   271 │   │   │   │   │   │   │      "Function is not allowed. You should only implement one "   │
│   272 │   │   │   │   │   │   │      "of them.")                                                 │
│   273 │   │   user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn                    │
│ ❱ 274 │   │   return user_fn(self, *args)                                                        │
│   275 │                                                                                          │
│   276 │   def apply_jvp(self, *args):                                                            │
│   277 │   │   # _forward_cls is defined by derived class                                         │
│                                                                                                  │
│ /home/yabin/miniconda3/envs/hcp/lib/python3.10/site-packages/torch/utils/checkpoint.py:157 in    │
│ backward                                                                                         │
│                                                                                                  │
│   154 │   │   │   raise RuntimeError(                                                            │
│   155 │   │   │   │   "none of output has requires_grad=True,"                                   │
│   156 │   │   │   │   " this checkpoint() is not necessary")                                     │
│ ❱ 157 │   │   torch.autograd.backward(outputs_with_grad, args_with_grad)                         │
│   158 │   │   grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else None                  │
│   159 │   │   │   │   │     for inp in detached_inputs)                                          │
│   160                                                                                            │
│                                                                                                  │
│ /home/yabin/miniconda3/envs/hcp/lib/python3.10/site-packages/torch/autograd/__init__.py:200 in   │
│ backward                                                                                         │
│                                                                                                  │
│   197 │   # The reason we repeat same the comment below is that                                  │
│   198 │   # some Python versions print out the first line of a multi-line function               │
│   199 │   # calls in the traceback and some print out the last line                              │
│ ❱ 200 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   201 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        │
│   202 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   │
│   203                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following
reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are
not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a
workaround if this module graph does not change during training loop.2) Reused parameters in multiple
reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of
your model, it would result in the same set of parameters been used by different reentrant backward passes
multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in
default. You can try to use _set_static_graph() as a workaround if your module graph does not change over
iterations.
Parameter at index 598 has been marked as ready twice. This means that multiple autograd engine  hooks have
fired for this particular parameter during this iteration. You can set the environment variable
TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

Config files I use:

_base_: [cfgs/train/train_base.yaml, cfgs/train/tuning_base.yaml]

unet:
  -
    lr: 1e-6
    layers:
      - ''

text_encoder:
  - lr: 1e-6
    layers:
      - ''

lora_unet: null
lora_text_encoder: null

tokenizer_pt:
  train: null

train:
  gradient_accumulation_steps: 1
  save_step: 100

  scheduler:
    name: 'constant_with_warmup'
    num_warmup_steps: 50
    num_training_steps: 600

model:
  pretrained_model_name_or_path: 'runwayml/stable-diffusion-v1-5'
  tokenizer_repeats: 1
  ema_unet: 0
  ema_text_encoder: 0
  enable_xformers: False

data:
  batch_size: 1
  prompt_template: 'prompt_tuning_template/object.txt'
  caption_file: null
  cache_latents: True
  tag_transforms:
    transforms:
      - _target_: hcpdiff.utils.caption_tools.TagShuffle
      - _target_: hcpdiff.utils.caption_tools.TagDropout
        p: 0.1
      - _target_: hcpdiff.utils.caption_tools.TemplateFill
        word_names:
          pt1: sks
          class: dog
  bucket:
    _target_: hcpdiff.data.bucket.RatioBucket.from_files
    img_root: '/home/yabin/datasets/custom/enma_ai/'
    target_area: {_target_: "builtins.eval", _args_: ['512*512']}
    num_bucket: 1

data_class:
  null

_base_: [cfgs/train/train_base.yaml, cfgs/train/tuning_base.yaml]

unet:
  - lr: 1e-6
    layers:
      - '' # fine-tuning all layers in unet

# fine-tuning text-encoder
text_encoder:
  - lr: 1e-6
    layers:
      - ''

tokenizer_pt:
  train: null

train:
  gradient_accumulation_steps: 1
  save_step: 100

  scheduler:
    name: 'constant_with_warmup'
    num_warmup_steps: 500
    num_training_steps: 20000

model:
  pretrained_model_name_or_path: 'stabilityai/stable-diffusion-2-1'
#  pretrained_model_name_or_path: '/home/yabin/HCP-Diffusion/converted_models/realismengine'
  tokenizer_repeats: 1
  ema_unet: 0
  ema_text_encoder: 0
  enable_xformers: False

data:
  batch_size: 1
  prompt_template: 'prompt_tuning_template/object.txt'
  caption_file: null
  cache_latents: True
  tag_transforms:
    transforms:
      - _target_: hcpdiff.utils.caption_tools.TagShuffle
      - _target_: hcpdiff.utils.caption_tools.TagDropout
        p: 0.1
      - _target_: hcpdiff.utils.caption_tools.TemplateFill
        word_names: {}
  bucket:
    _target_: hcpdiff.data.bucket.RatioBucket.from_files # aspect ratio bucket
    img_root: '/home/yabin/datasets/custom/enma_ai/'
    target_area: {_target_: "builtins.eval", _args_: ['1024*1024']}
    num_bucket: 1

data_class: null

Single card training and multi-gpu training with LoRA work fine.

IrisRainbowNeko / HCP-Diffusion

Bug Report: Fine-tuning & Dreambooth collapse when using multi-gpus #7