THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
9.1k stars 857 forks source link

Full-parameter finetune后,生成视频的主体空间扭曲 #116

Closed CacacaLalala closed 3 months ago

CacacaLalala commented 3 months ago

您好!非常棒的开源repo~ 最近尝试了Lora和full-parameter的finetune,均使用同样的50个video,微调500次迭代,其余setting没有修改 发现full-parameter的微调后,生成视频的主体会非常扭曲,lora的微调形式没有这种明显扭曲 下面是同样的prompt: spider making a web的结果: full-parameter微调后

https://github.com/user-attachments/assets/19f4f8bb-973c-4b42-9a0c-422c1af29af0

lora微调后

https://github.com/user-attachments/assets/4ab01f8b-8e05-4fbc-b038-a2796c69adfa

不知道导致这一问题的原因是什么?是微调时的lr太高的原因吗? 期待您的回复,感谢

zRzRzRzRzRzRzR commented 3 months ago

想知道您使用了多少数据进行微调,推荐使用100条相似的视频,以及, 您使用了默认配置吗,能提供一下loss的下降情况吗

CacacaLalala commented 3 months ago

想知道您使用了多少数据进行微调,推荐使用100条相似的视频,以及, 您使用了默认配置吗,能提供一下loss的下降情况吗

感谢您的回复! 我是想实现在您的模型权重基础上继续用其他数据进行训练的功能,所以我是在数据集中先随机抽取了50条视频。 是默认配置,training_config如下: `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

https://github.com/user-attachments/assets/90ec5432-c226-4933-8c04-89a58df31e43

4000次迭代:

https://github.com/user-attachments/assets/73957b79-f1aa-4077-8e9d-9cab11e2da53

期待您的回复~

tengjiayan20 commented 3 months ago

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

CacacaLalala commented 3 months ago

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Are there other factors besides the learning rate? Because the learning rate I am currently using is 1e-5, but as the training progresses, I will still observe a gradual decline in spatial ability. Looking forward to your reply!

tengjiayan20 commented 3 months ago

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Are there other factors besides the learning rate? Because the learning rate I am currently using is 1e-5, but as the training progresses, I will still observe a gradual decline in spatial ability. Looking forward to your reply!

Is the prompt you use, "spider making a web", too different from your sft training data? And what is the total batch size? And in theory, for a small dataset with size 50, too much training will make model overfit data, resulting in totally same videos.

CacacaLalala commented 3 months ago

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Are there other factors besides the learning rate? Because the learning rate I am currently using is 1e-5, but as the training progresses, I will still observe a gradual decline in spatial ability. Looking forward to your reply!

Is the prompt you use, "spider making a web", too different from your sft training data? And what is the total batch size? And in theory, for a small dataset with size 50, too much training will make model overfit data, resulting in totally same videos.

The total batch size is 24*2, and I'm using 100w dataset by changing dataset part. Next, waiting for more iterations, I test the training again. Thanks a lot!

GFENGG commented 3 months ago

想知道您使用了多少数据进行微调,推荐使用100条相似的视频,以及, 您使用了默认配置吗,能提供一下loss的下降情况吗

感谢您的回复! 我是想实现在您的模型权重基础上继续用其他数据进行训练的功能,所以我是在数据集中先随机抽取了50条视频。 是默认配置,training_config如下: `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

  • dataset/mini_dataset/cogvideo/videos valid_data:
  • dataset/mini_dataset/cogvideo/videos split: 1,0,0 num_workers: 8 force_train: true only_log_video_latents: true data: target: data_video.SFTDataset params: video_size:

    • 480

    • 720 fps: 8 max_num_frames: 49 skip_frms_num: 3.0 deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: false fp16: enabled: true loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas:

    • 0.9

    • 0.95 eps: 1.0e-08 weight_decay: 0.0001 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys:

  • txt denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: false weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: true num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30 transformer_args: checkpoint_activations: true vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875 patch_embed_config: target: dit_video_concat.ImagePatchEmbeddingMixin params: text_hidden_size: 4096 adaln_layer_config: target: dit_video_concat.AdaLNMixin params: qk_ln: true final_layer_config: target: dit_video_concat.FinalLayerMixin conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models:

    • is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: ckpts/cogvideo/t5-v1_1-xxl max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt ignore_keys:

    • loss loss_config: target: torch.nn.Identity regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

    • 1

    • 2

    • 2

    • 4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: true decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

    • 1

    • 2

    • 2

    • 4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: false loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: true num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: true discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50 ` 不好意思我目前还没对repo做过多修改,loss还没有记录下来 您在使用full parameter微调时有观察到这种空间扭曲的问题吗? 我尝试降低学习率后,这一问题有所改善,但还是随着训练过程,扭曲问题会变得越来越严重 500次迭代:

    000000.mp4 4000次迭代:

    000000.mp4 期待您的回复~

看起来4000步的结果也还比较正常,请问这里说的扭曲问题具体是指什么呢?

CacacaLalala commented 3 months ago

想知道您使用了多少数据进行微调,推荐使用100条相似的视频,以及, 您使用了默认配置吗,能提供一下loss的下降情况吗

感谢您的回复! 我是想实现在您的模型权重基础上继续用其他数据进行训练的功能,所以我是在数据集中先随机抽取了50条视频。 是默认配置,training_config如下: `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

  • dataset/mini_dataset/cogvideo/videos valid_data:
  • dataset/mini_dataset/cogvideo/videos split: 1,0,0 num_workers: 8 force_train: true only_log_video_latents: true data: target: data_video.SFTDataset params: video_size:

    • 480

    • 720 fps: 8 max_num_frames: 49 skip_frms_num: 3.0 deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: false fp16: enabled: true loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas:

    • 0.9

    • 0.95 eps: 1.0e-08 weight_decay: 0.0001 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys:

  • txt denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: false weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: true num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30 transformer_args: checkpoint_activations: true vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875 patch_embed_config: target: dit_video_concat.ImagePatchEmbeddingMixin params: text_hidden_size: 4096 adaln_layer_config: target: dit_video_concat.AdaLNMixin params: qk_ln: true final_layer_config: target: dit_video_concat.FinalLayerMixin conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models:

    • is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: ckpts/cogvideo/t5-v1_1-xxl max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt ignore_keys:

    • loss loss_config: target: torch.nn.Identity regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

    • 1

    • 2

    • 2

    • 4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: true decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

    • 1

    • 2

    • 2

    • 4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: false loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: true num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: true discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50 ` 不好意思我目前还没对repo做过多修改,loss还没有记录下来 您在使用full parameter微调时有观察到这种空间扭曲的问题吗? 我尝试降低学习率后,这一问题有所改善,但还是随着训练过程,扭曲问题会变得越来越严重 500次迭代:

000000.mp4 4000次迭代: 000000.mp4 期待您的回复~

看起来4000步的结果也还比较正常,请问这里说的扭曲问题具体是指什么呢?

一开始说的扭曲就是空间结构会有一些不合理。 目前多训练了几天,刚刚测试了一下,看起来效果正常啦,感谢。

GFENGG commented 3 months ago

想知道您使用了多少数据进行微调,推荐使用100条相似的视频,以及, 您使用了默认配置吗,能提供一下loss的下降情况吗

感谢您的回复! 我是想实现在您的模型权重基础上继续用其他数据进行训练的功能,所以我是在数据集中先随机抽取了50条视频。 是默认配置,training_config如下: `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

  • dataset/mini_dataset/cogvideo/videos valid_data:
  • dataset/mini_dataset/cogvideo/videos split: 1,0,0 num_workers: 8 force_train: true only_log_video_latents: true data: target: data_video.SFTDataset params: video_size:

    • 480

    • 720 fps: 8 max_num_frames: 49 skip_frms_num: 3.0 deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: false fp16: enabled: true loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas:

    • 0.9

    • 0.95 eps: 1.0e-08 weight_decay: 0.0001 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys:

  • txt denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: false weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: true num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30 transformer_args: checkpoint_activations: true vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875 patch_embed_config: target: dit_video_concat.ImagePatchEmbeddingMixin params: text_hidden_size: 4096 adaln_layer_config: target: dit_video_concat.AdaLNMixin params: qk_ln: true final_layer_config: target: dit_video_concat.FinalLayerMixin conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models:

    • is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: ckpts/cogvideo/t5-v1_1-xxl max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt ignore_keys:

    • loss loss_config: target: torch.nn.Identity regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

    • 1

    • 2

    • 2

    • 4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: true decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

    • 1

    • 2

    • 2

    • 4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: false loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: true num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: true discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50 ` 不好意思我目前还没对repo做过多修改,loss还没有记录下来 您在使用full parameter微调时有观察到这种空间扭曲的问题吗? 我尝试降低学习率后,这一问题有所改善,但还是随着训练过程,扭曲问题会变得越来越严重 500次迭代:

000000.mp4 4000次迭代: 000000.mp4 期待您的回复~

看起来4000步的结果也还比较正常,请问这里说的扭曲问题具体是指什么呢?

一开始说的扭曲就是空间结构会有一些不合理。 目前多训练了几天,刚刚测试了一下,看起来效果正常啦,感谢。

我也在尝试finetune,所以空间结构不合理的问题是靠调小学习率 + 长时间训练解决的么?

CacacaLalala commented 2 months ago

理的问题是靠调小学习率 + 长时间训练解

目前看是这样

a-r-r-o-w commented 2 months ago

Hey everyone! I have a few questions on finetuning that I would love if you could answer:

Thanks to everyone in advance! I might bother you with some more questions