Given groups=1, weight of size [1152, 12, 2, 2], expected input[8, 16, 64, 64] to have 12 channels, but got 16 channels instead

chintha commented 3 months ago

Try to run a training session but met with below error inside the training_losses function

Exception has occurred: RuntimeError Given groups=1, weight of size [1152, 12, 2, 2], expected input[8, 16, 64, 64] to have 12 channels, but got 16 channels instead File "/home/jeewantha-chintha/EasyAnimate/EasyAnimate/easyanimate/models/transformer3d.py", line 540, in forward hidden_states = self.pos_embed(hidden_states) File "/home/jeewantha-chintha/EasyAnimate/EasyAnimate/easyanimate/utils/respace.py", line 131, in call return self.model(x, timestep=new_ts, kwargs) File "/home/jeewantha-chintha/EasyAnimate/EasyAnimate/easyanimate/utils/gaussian_diffusion.py", line 751, in training_losses model_output = model(x_t, timestep=t, model_kwargs)[0] File "/home/jeewantha-chintha/EasyAnimate/EasyAnimate/easyanimate/utils/respace.py", line 94, in training_losses return super().training_losses(self._wrap_model(model), *args, **kwargs) File "/home/jeewantha-chintha/EasyAnimate/EasyAnimate/scripts/train.py", line 1609, in main loss_term = train_diffusion.training_losses( File "/home/jeewantha-chintha/EasyAnimate/EasyAnimate/scripts/train.py", line 1757, in main() RuntimeError: Given groups=1, weight of size [1152, 12, 2, 2], expected input[8, 16, 64, 64] to have 12 channels, but got 16 channels instead

yunkchen commented 3 months ago

how did you modify the train config? could you show your config?

chintha commented 3 months ago

transformer_additional_kwargs: patch_3d: false fake_3d: false basic_block_type: "global_motionmodule" time_position_encoding_before_transformer: false motion_module_type: "Vanilla" enable_uvit: true

motion_module_kwargs_even: num_attention_heads: 16 num_transformer_block: 1 attention_block_types: [ "Temporal_Self", "Temporal_Self" ] temporal_position_encoding: true temporal_position_encoding_max_len: 4096 temporal_attention_dim_div: 1 block_size: 1 remove_time_embedding_in_photo: false motion_module_kwargs_odd: num_attention_heads: 16 num_transformer_block: 1 attention_block_types: [ "Temporal_Self", "Global_Self" ] temporal_position_encoding: true temporal_position_encoding_max_len: 4096 temporal_attention_dim_div: 1 block_size: 1 remove_time_embedding_in_photo: false

noise_scheduler_kwargs: beta_start: 0.0001 beta_end: 0.02 beta_schedule: "linear" steps_offset: 1

vae_kwargs: enable_magvit: true

enable_multi_text_encoder: false

chintha commented 3 months ago

{ "version": "0.2.0", "configurations": [ { "name": "Python: Train Model", "type": "python", "request": "launch", "pythonPath": "/home/jeewantha-chintha/EasyAnimate/venv/bin/python", // For older VSCode versions "program": "${workspaceFolder}/scripts/train.py", "args": [ "--pretrained_model_name_or_path=models/Diffusion_Transformer/EasyAnimateV3-XL-2-InP-512x512", "--train_data_dir=/home/jeewantha-chintha/EasyAnimate/EasyAnimate/datasets/internal_datasets/videos/", "--train_data_meta=/home/jeewantha-chintha/EasyAnimate/EasyAnimate/datasets/internal_datasets/json_of_internal_datasets.json", "--config_path=config/easyanimate_video_slicevae_motion_module_v3.yaml", "--image_sample_size=512", "--video_sample_size=512", "--video_sample_stride=1", "--video_sample_n_frames=36", "--train_batch_size=1", "--video_repeat=1", "--gradient_accumulation_steps=1", "--dataloader_num_workers=4", "--num_train_epochs=10", "--checkpointing_steps=500", "--learning_rate=2e-05", "--lr_scheduler=constant_with_warmup", "--lr_warmup_steps=100", "--seed=42", "--output_dir=output_dir", "--enable_xformers_memory_efficient_attention", "--gradient_checkpointing", "--mixed_precision=bf16", "--adam_weight_decay=0.03", "--adam_epsilon=1e-10", "--max_grad_norm=1", "--vae_mini_batch=1", "--random_frame_crop", "--enable_bucket", "--train_mode=normal", "--trainable_modules=transformer_blocks,proj_out,pos_embed,long_connect_fc" ], "console": "integratedTerminal", "env": { "NCCL_IB_DISABLE": "1", "NCCL_P2P_DISABLE": "1", "NCCL_DEBUG": "INFO" }, "cwd": "${workspaceFolder}" } ] }

yunkchen commented 2 months ago

try roll-back 'video_sample_n_frames' to 144

aigc-apps / EasyAnimate

Given groups=1, weight of size [1152, 12, 2, 2], expected input[8, 16, 64, 64] to have 12 channels, but got 16 channels instead #82