Full-parameter finetune后，生成视频的主体空间扭曲

您好！非常棒的开源repo~ 最近尝试了Lora和full-parameter的finetune，均使用同样的50个video，微调500次迭代，其余setting没有修改发现full-parameter的微调后，生成视频的主体会非常扭曲，lora的微调形式没有这种明显扭曲下面是同样的prompt： spider making a web的结果： full-parameter微调后

https://github.com/user-attachments/assets/19f4f8bb-973c-4b42-9a0c-422c1af29af0

lora微调后

https://github.com/user-attachments/assets/4ab01f8b-8e05-4fbc-b038-a2796c69adfa

不知道导致这一问题的原因是什么？是微调时的lr太高的原因吗？期待您的回复，感谢

想知道您使用了多少数据进行微调，推荐使用100条相似的视频，以及，您使用了默认配置吗，能提供一下loss的下降情况吗

想知道您使用了多少数据进行微调，推荐使用100条相似的视频，以及，您使用了默认配置吗，能提供一下loss的下降情况吗

感谢您的回复！我是想实现在您的模型权重基础上继续用其他数据进行训练的功能，所以我是在数据集中先随机抽取了50条视频。是默认配置，training_config如下： `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

dataset/mini_dataset/cogvideo/videos valid_data:
dataset/mini_dataset/cogvideo/videos split: 1,0,0 num_workers: 8 force_train: true only_log_video_latents: true data: target: data_video.SFTDataset params: video_size:
- 480
- 720 fps: 8 max_num_frames: 49 skip_frms_num: 3.0 deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: false fp16: enabled: true loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas:
  - 0.9
  - 0.95 eps: 1.0e-08 weight_decay: 0.0001 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys:
txt denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: false weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: true num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30 transformer_args: checkpoint_activations: true vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875 patch_embed_config: target: dit_video_concat.ImagePatchEmbeddingMixin params: text_hidden_size: 4096 adaln_layer_config: target: dit_video_concat.AdaLNMixin params: qk_ln: true final_layer_config: target: dit_video_concat.FinalLayerMixin conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models:
- is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: ckpts/cogvideo/t5-v1_1-xxl max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt ignore_keys:
- loss loss_config: target: torch.nn.Identity regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:
  - 1
  - 2
  - 2
  - 4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: true decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:
  - 1
  - 2
  - 2
  - 4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: false loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: true num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: true discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50 ` 不好意思我目前还没对repo做过多修改，loss还没有记录下来您在使用full parameter微调时有观察到这种空间扭曲的问题吗？我尝试降低学习率后，这一问题有所改善，但还是随着训练过程，扭曲问题会变得越来越严重 500次迭代：

https://github.com/user-attachments/assets/90ec5432-c226-4933-8c04-89a58df31e43

4000次迭代：

https://github.com/user-attachments/assets/73957b79-f1aa-4077-8e9d-9cab11e2da53

期待您的回复~

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Are there other factors besides the learning rate? Because the learning rate I am currently using is 1e-5, but as the training progresses, I will still observe a gradual decline in spatial ability. Looking forward to your reply!

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Are there other factors besides the learning rate? Because the learning rate I am currently using is 1e-5, but as the training progresses, I will still observe a gradual decline in spatial ability. Looking forward to your reply!

Is the prompt you use, "spider making a web", too different from your sft training data? And what is the total batch size? And in theory, for a small dataset with size 50, too much training will make model overfit data, resulting in totally same videos.

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Are there other factors besides the learning rate? Because the learning rate I am currently using is 1e-5, but as the training progresses, I will still observe a gradual decline in spatial ability. Looking forward to your reply!

Is the prompt you use, "spider making a web", too different from your sft training data? And what is the total batch size? And in theory, for a small dataset with size 50, too much training will make model overfit data, resulting in totally same videos.

The total batch size is 24*2, and I'm using 100w dataset by changing dataset part. Next, waiting for more iterations, I test the training again. Thanks a lot!

想知道您使用了多少数据进行微调，推荐使用100条相似的视频，以及，您使用了默认配置吗，能提供一下loss的下降情况吗

感谢您的回复！我是想实现在您的模型权重基础上继续用其他数据进行训练的功能，所以我是在数据集中先随机抽取了50条视频。是默认配置，training_config如下： `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

dataset/mini_dataset/cogvideo/videos valid_data:

dataset/mini_dataset/cogvideo/videos split: 1,0,0 num_workers: 8 force_train: true only_log_video_latents: true data: target: data_video.SFTDataset params: video_size:

480

720 fps: 8 max_num_frames: 49 skip_frms_num: 3.0 deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: false fp16: enabled: true loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas:

0.9

0.95 eps: 1.0e-08 weight_decay: 0.0001 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys:

txt denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: false weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: true num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30 transformer_args: checkpoint_activations: true vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875 patch_embed_config: target: dit_video_concat.ImagePatchEmbeddingMixin params: text_hidden_size: 4096 adaln_layer_config: target: dit_video_concat.AdaLNMixin params: qk_ln: true final_layer_config: target: dit_video_concat.FinalLayerMixin conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models:

is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: ckpts/cogvideo/t5-v1_1-xxl max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt ignore_keys:

loss loss_config: target: torch.nn.Identity regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

1

2

2

4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: true decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

1

2

2

4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: false loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: true num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: true discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50 ` 不好意思我目前还没对repo做过多修改，loss还没有记录下来您在使用full parameter微调时有观察到这种空间扭曲的问题吗？我尝试降低学习率后，这一问题有所改善，但还是随着训练过程，扭曲问题会变得越来越严重 500次迭代：

000000.mp4 4000次迭代：

000000.mp4 期待您的回复~

看起来4000步的结果也还比较正常，请问这里说的扭曲问题具体是指什么呢？

想知道您使用了多少数据进行微调，推荐使用100条相似的视频，以及，您使用了默认配置吗，能提供一下loss的下降情况吗

感谢您的回复！我是想实现在您的模型权重基础上继续用其他数据进行训练的功能，所以我是在数据集中先随机抽取了50条视频。是默认配置，training_config如下： `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

dataset/mini_dataset/cogvideo/videos valid_data:

dataset/mini_dataset/cogvideo/videos split: 1,0,0 num_workers: 8 force_train: true only_log_video_latents: true data: target: data_video.SFTDataset params: video_size:

480

720 fps: 8 max_num_frames: 49 skip_frms_num: 3.0 deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: false fp16: enabled: true loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas:

0.9

0.95 eps: 1.0e-08 weight_decay: 0.0001 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys:

txt denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: false weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: true num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30 transformer_args: checkpoint_activations: true vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875 patch_embed_config: target: dit_video_concat.ImagePatchEmbeddingMixin params: text_hidden_size: 4096 adaln_layer_config: target: dit_video_concat.AdaLNMixin params: qk_ln: true final_layer_config: target: dit_video_concat.FinalLayerMixin conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models:

is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: ckpts/cogvideo/t5-v1_1-xxl max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt ignore_keys:

loss loss_config: target: torch.nn.Identity regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

1

2

2

4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: true decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

1

2

2

4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: false loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: true num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: true discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50 ` 不好意思我目前还没对repo做过多修改，loss还没有记录下来您在使用full parameter微调时有观察到这种空间扭曲的问题吗？我尝试降低学习率后，这一问题有所改善，但还是随着训练过程，扭曲问题会变得越来越严重 500次迭代：

000000.mp4 4000次迭代： 000000.mp4 期待您的回复~

看起来4000步的结果也还比较正常，请问这里说的扭曲问题具体是指什么呢？

一开始说的扭曲就是空间结构会有一些不合理。目前多训练了几天，刚刚测试了一下，看起来效果正常啦，感谢。

想知道您使用了多少数据进行微调，推荐使用100条相似的视频，以及，您使用了默认配置吗，能提供一下loss的下降情况吗

感谢您的回复！我是想实现在您的模型权重基础上继续用其他数据进行训练的功能，所以我是在数据集中先随机抽取了50条视频。是默认配置，training_config如下： `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

dataset/mini_dataset/cogvideo/videos valid_data:

dataset/mini_dataset/cogvideo/videos split: 1,0,0 num_workers: 8 force_train: true only_log_video_latents: true data: target: data_video.SFTDataset params: video_size:

480

720 fps: 8 max_num_frames: 49 skip_frms_num: 3.0 deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: false fp16: enabled: true loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas:

0.9

0.95 eps: 1.0e-08 weight_decay: 0.0001 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys:

txt denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: false weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: true num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30 transformer_args: checkpoint_activations: true vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875 patch_embed_config: target: dit_video_concat.ImagePatchEmbeddingMixin params: text_hidden_size: 4096 adaln_layer_config: target: dit_video_concat.AdaLNMixin params: qk_ln: true final_layer_config: target: dit_video_concat.FinalLayerMixin conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models:

is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: ckpts/cogvideo/t5-v1_1-xxl max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt ignore_keys:

loss loss_config: target: torch.nn.Identity regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

1

2

2

4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: true decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

1

2

2

4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: false loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: true num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: true discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50 ` 不好意思我目前还没对repo做过多修改，loss还没有记录下来您在使用full parameter微调时有观察到这种空间扭曲的问题吗？我尝试降低学习率后，这一问题有所改善，但还是随着训练过程，扭曲问题会变得越来越严重 500次迭代：

000000.mp4 4000次迭代： 000000.mp4 期待您的回复~

看起来4000步的结果也还比较正常，请问这里说的扭曲问题具体是指什么呢？

一开始说的扭曲就是空间结构会有一些不合理。目前多训练了几天，刚刚测试了一下，看起来效果正常啦，感谢。

我也在尝试finetune，所以空间结构不合理的问题是靠调小学习率 + 长时间训练解决的么？

理的问题是靠调小学习率 + 长时间训练解

目前看是这样

Hey everyone! I have a few questions on finetuning that I would love if you could answer:

Is a dataset size of 50-100 videos okay for teaching the model a single concept? Can we go lower?
How many total training steps are required for convergence assuming I have 50 videos using training batch size of 1? Do we really need 4000+ steps?
What initialization works best with LoRA layers? Is the default A = kaiming_uniform, B = 0 the best? Can we use gaussian or different initialization supported in libraries like peft.
Do we need the FusedEmaAdam implementation? Do we need EMA at all? Is simple torch.nn.Adam okay for training?
Even after a somewhat successful training run, results for prompts that the model was finetuned on are okay but for any other prompt, I get weird looking and artifacted outputs
How much memory is required to finetune the 5B model? Is it possible to do on a single A100 GPU? If not, what can be optimized? I've tried VAE slicing and tiling but it still OOMs even with training batch size of 1.
Has anyone successfully trained a LoRA with lower rank than 128 producing good results?
What training batch size are you able to use comfortably on a single 80 GB GPU when finetuning the 2B model?
Any tips/techniques on speeding up training?

Thanks to everyone in advance! I might bother you with some more questions

THUDM / CogVideo

Full-parameter finetune后，生成视频的主体空间扭曲 #116