Free guidance: IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 9216, 320] at index 0

pearbender commented 3 months ago

Line 180 here fails in stage 1 training.

https://github.com/fudan-generative-vision/champ/blob/02a9a24a9183727dcbb8eb432b46b3a19302bcb8/models/mutual_self_attention.py#L166-L186

I had to do

do_classifier_free_guidance = False

to prevent this error, however I do not know how this will impact the result.

Here is my terminal log.

(env) C:\Users\user\code\champ>accelerate launch train_s1.py --config configs/train/stage1.yaml
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `1`
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
05/13/2024 12:50:07 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'force_upcast', 'scaling_factor'} was not found in config. Values will be initialized to default values.
{'mid_block_only_cross_attention', 'addition_time_embed_dim', 'cross_attention_norm', 'class_embeddings_concat', 'reverse_transformer_layers_per_block', 'encoder_hid_dim', 'class_embed_type', 'num_attention_heads', 'encoder_hid_dim_type', 
'projection_class_embeddings_input_dim', 'addition_embed_type_num_heads', 'addition_embed_type', 'dropout', 'resnet_time_scale_shift', 'time_cond_proj_dim', 'time_embedding_act_fn', 'resnet_out_scale_factor', 'dual_cross_attention', 'only_cross_attention', 'resnet_skip_time_act', 'conv_out_kernel', 'transformer_layers_per_block', 'use_linear_projection', 'num_class_embeds', 'upcast_attention', 'conv_in_kernel', 'timestep_post_act', 'time_embedding_type', 'attention_type', 'mid_block_type', 'time_embedding_dim'} was not found in config. Values will be initialized to default values.
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel: 
 ['conv_norm_out.weight, conv_norm_out.bias, conv_out.weight, conv_out.bias']
05/13/2024 12:50:17 - INFO - models.unet_3d - loaded temporal unet's pretrained weights from pretrained_models\stable-diffusion-v1-5\unet ...
{'motion_module_mid_block', 'use_linear_projection', 'num_class_embeds', 'upcast_attention', 'use_inflated_groupnorm', 'unet_use_cross_frame_attention', 'class_embed_type', 'motion_module_type', 'dual_cross_attention', 'only_cross_attention', 'motion_module_decoder_only', 'motion_module_kwargs', 'resnet_time_scale_shift', 'motion_module_resolutions'} was not found in config. Values will be initialized to default values.
05/13/2024 12:50:20 - INFO - models.unet_3d - Loaded 0.0M-parameter motion module
05/13/2024 12:50:25 - INFO - __main__ - Start training ...
05/13/2024 12:50:25 - INFO - __main__ - Num Samples: 1
05/13/2024 12:50:25 - INFO - __main__ - Train Batchsize: 1
05/13/2024 12:50:25 - INFO - __main__ - Num Epochs: 100000
05/13/2024 12:50:25 - INFO - __main__ - Total Steps: 100000
Steps:   0%|                                                                                                                                                                                           | 1/100000 [00:33<940:12:12, 33.85s/it]05/13/2024 12:51:00 - INFO - __main__ - Running validation ...
The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [18:59<00:00, 57.00s/it]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.56s/it]
Steps:   0%|                                                                                                                                                         | 1/100000 [20:21<940:12:12, 33.85s/it, lr=1e-5, stage=1, step_loss=1.48]Traceback (most recent call last):
  File "C:\Users\user\code\champ\train_s1.py", line 675, in <module>
    main(config)
  File "C:\Users\user\code\champ\train_s1.py", line 495, in main
    model_pred = model(
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\user\code\champ\env\lib\site-packages\accelerate\utils\operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "C:\Users\user\code\champ\env\lib\site-packages\accelerate\utils\operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\amp\autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "C:\Users\user\code\champ\models\champ_model.py", line 63, in forward
    model_pred = self.denoising_unet(
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\user\code\champ\models\unet_3d.py", line 493, in forward
    sample, res_samples = downsample_block(
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\user\code\champ\models\unet_3d_blocks.py", line 442, in forward
    hidden_states = attn(
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\user\code\champ\models\transformer_3d.py", line 141, in forward
    hidden_states = block(
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\user\code\champ\env\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\user\code\champ\models\mutual_self_attention.py", line 181, in hacked_basic_transformer_inner_forward
    norm_hidden_states[_uc_mask],
IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 9216, 320] at index 0
Steps:   0%|                                                                                                                                                     | 1/100000 [20:36<34353:38:21, 1236.74s/it, lr=1e-5, stage=1, step_loss=1.48]
Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\user\code\champ\env\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "C:\Users\user\code\champ\env\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
    args.func(args)
  File "C:\Users\user\code\champ\env\lib\site-packages\accelerate\commands\launch.py", line 979, in launch_command
    simple_launcher(args)
  File "C:\Users\user\code\champ\env\lib\site-packages\accelerate\commands\launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\user\\code\\champ\\env\\Scripts\\python.exe', 'train_s1.py', '--config', 'configs/train/stage1.yaml']' returned non-zero exit status 1.

Here is my stage1.yaml.

exp_name: 'stage1'
output_dir: './exp_output'
seed: 42
resume_from_checkpoint: ''

checkpointing_steps: 2000
save_model_epoch_interval: 20

data:
  train_bs: 1
  video_folder: './training_data' # Your data root folder
  guids: 
    - 'depth'
    - 'normal'
    - 'semantic_map'
    - 'dwpose'
  image_size: 768
  bbox_crop: false
  bbox_resize_ratio: [0.9, 1.5]
  aug_type: "Resize"
  data_parts:
    - "all"
  sample_margin: 30

validation:
  validation_steps: 1000
  ref_images:
    - ./reference_imgs/images/ref-01.png
  guidance_folders:
    - ./training_data/1feec204f03a1a779085107b375df72a
  guidance_indexes: [0, 30, 60, 90, 120]            

solver:
  gradient_accumulation_steps: 1
  mixed_precision: 'fp16'
  enable_xformers_memory_efficient_attention: True 
  gradient_checkpointing: False 
  max_train_steps: 100000  # 50000
  max_grad_norm: 1.0
  # lr
  learning_rate: 1.0e-5
  scale_lr: False 
  lr_warmup_steps: 1
  lr_scheduler: 'constant'

  # optimizer
  use_8bit_adam: False 
  adam_beta1: 0.9
  adam_beta2: 0.999
  adam_weight_decay:  1.0e-2
  adam_epsilon: 1.0e-8

noise_scheduler_kwargs:
  num_train_timesteps: 1000
  beta_start:          0.00085
  beta_end:            0.012
  beta_schedule:       "scaled_linear"
  steps_offset:        1
  clip_sample:         false

guidance_encoder_kwargs:
  guidance_embedding_channels: 320
  guidance_input_channels: 3
  block_out_channels: [16, 32, 96, 256]

base_model_path: 'pretrained_models/stable-diffusion-v1-5'
vae_model_path: 'pretrained_models/sd-vae-ft-mse'
image_encoder_path: 'pretrained_models/image_encoder'

weight_dtype: 'fp16'  # [fp16, fp32]
uncond_ratio: 0.1
noise_offset: 0.05
snr_gamma: 5.0
enable_zero_snr: True

Leoooo333 commented 3 months ago

Hi @pearbender , actually you don't need to set do_classifier_free_guidance to true when training even if you want to enable CFG.

During training, the Classifier-Free Guidance works by randomly sampling conditional and unconditional input ratio as uncond_ratio: 0.1. You can modify the ratio to 0 if you wanna disable CFG training.

During inference time, set do_classifer_free_guidance=True to enable CFG. You may also find cfg_scale helpful.

pearbender commented 3 months ago

@Leoooo333 Currently during stage 1 training do_classifier_free_guidance is True by default causing the error I posed to occur. If it is OK to set to false during stage 1 training then the code should be changed, right?

Beijia11 commented 2 months ago

Hi, I have also met this error in stage 1 training, is it all well to set do_classifier_free_guidance to false during stage 1 training? @pearbender

fudan-generative-vision / champ

Free guidance: IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 9216, 320] at index 0 #117