为什么lora微调后进行推理，会出现这种结果

System Info / 系統信息

https://github.com/user-attachments/assets/e8c501fb-53d5-4377-9e27-0824e864123a

prompt是：In the video, we see a monkey sitting on a rock by a pond. The monkey is seen in various states of repose, with its reflection visible in the water. The scene is serene and peaceful, with the monkey's fur and the surrounding foliage adding to the tranquility. The lighting is soft, and the colors are muted, creating a calm atmosphere. The monkey appears to be in a natural habitat, possibly a park or wildlife sanctuary, and the setting suggests a quiet moment in the life of this animal.

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

完全遵守readme中的步骤，使用自己收集的若干视频，并且使用CogVLM2-Video进行的打标

Expected behavior / 期待表现

希望可以提供清晰的lora微调和加载lora模型的推理步骤，感谢

You need to modify the inference config: keep the network_config the same as sft config, i.e. add lora_config to the inference config. To avoid misunderstanding, we will update a new inference config for lora inference soon. WX20240812-140332

Snipaste_2024-08-15_11-19-45 我加上参数了，但是出现了这种情况

stf.yaml: args: checkpoint_activations: True ## using gradient checkpointing model_parallel_size: 1 experiment_name: lora-disney mode: finetune load: "CogVideoX-2b-sat/transformer" no_load_rng: True train_iters: 200 eval_iters: 1 eval_interval: 100 eval_batch_size: 1 save: ckpts save_interval: 100 log_interval: 20 train_data: ["/data01/cly/dataset/sat_cogvideox_cly/selected_100"] valid_data: ["/data01/cly/dataset/sat_cogvideox_cly/selected_100"] split: 1,0,0 num_workers: 8 force_train: True only_log_video_latents: True

data: target: data_video.SFTDataset params: video_size: [480, 720] fps: 8 max_num_frames: 49 skip_frms_num: 3.

deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: False fp16: enabled: True loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas: [0.9, 0.95] eps: 1e-8 weight_decay: 1e-4 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false

model: scale_factor: 1.15258426 disable_first_stage_autocast: true not_trainable_prefixes: ['all'] ## Using Lora log_keys:

txt

denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: False

weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0

network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: True num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30

transformer_args: checkpoint_activations: True ## using gradient checkpointing vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false

modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875
```
lora_config: ## Using Lora
  target: sat.model.finetune.lora2.LoraMixin
  params:
    r: 128

patch_embed_config:
  target: dit_video_concat.ImagePatchEmbeddingMixin
  params:
    text_hidden_size: 4096

adaln_layer_config:
  target: dit_video_concat.AdaLNMixin
  params:
    qk_ln: True

final_layer_config:
  target: dit_video_concat.FinalLayerMixin
```
conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models:
- is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: "CogVideoX-2b-sat/t5-v1_1-xxl" max_length: 226
first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" ignore_keys: [ 'loss' ]

loss_config: target: torch.nn.Identity

regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer

encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: True

decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: True z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: false

loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: True num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0

sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: True

discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0

guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50

infer.yaml: args: latent_channels: 16 mode: inference load: "/data01/cly/project/CogVideo/sat/ckpts/lora-disney-08-15-10-45" batch_size: 1 input_type: txt input_file: configs/test.txt sampling_num_frames: 13 # Must be 13, 11 or 9 sampling_fps: 8 fp16: True output_dir: outputs_lora_04_100/ force_inference: True

model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys:

txt

denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: False

weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0

network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: True num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30

transformer_args: vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false

modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875
```
lora_config: ## Using Lora
  target: sat.model.finetune.lora2.LoraMixin
  params:
    r: 128

patch_embed_config:
  target: dit_video_concat.ImagePatchEmbeddingMixin
  params:
    text_hidden_size: 4096

adaln_layer_config:
  target: dit_video_concat.AdaLNMixin
  params:
    qk_ln: True

final_layer_config:
  target: dit_video_concat.FinalLayerMixin
```
conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models:
- is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: "/data01/cly/project/CogVideo/sat/CogVideoX-2b-sat/t5-v1_1-xxl" max_length: 226
first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" ignore_keys: [ 'loss' ]

loss_config: target: torch.nn.Identity

regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer

encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: True

decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: True z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: false

loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: True num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0

sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: True

discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0

guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50

Have you seen your loss when fine-tuning? May it is Nan?

Have you seen your loss when fine-tuning? May it is Nan? The loss is acceptable.

The issue is occurring in sat/SwissArmyTransformer/sat/training/model_io.py at Line 224: model._save_checkpoint(save_dir, tag, client_state=client_state, exclude_frozen_parameters=True). I added exclude_frozen_parameters=True, but I encountered problems when loading the model for inference. How can I correctly load this part of the model weights?

I followed the modifications suggested in this thread: https://github.com/THUDM/CogVideo/issues/126#issuecomment-2286688314.

Have you seen your loss when fine-tuning? May it is Nan? The loss is acceptable.

The issue is occurring in sat/SwissArmyTransformer/sat/training/model_io.py at Line 224: model._save_checkpoint(save_dir, tag, client_state=client_state, exclude_frozen_parameters=True). I added exclude_frozen_parameters=True, but I encountered problems when loading the model for inference. How can I correctly load this part of the model weights?

I followed the modifications suggested in this thread: #126 (comment).

We do not suggest adding exclude_frozen_parameters=True, unless you know how to recover the entire model weight. The recover method we will update in the future.

我加上参数了，但是出现了这种情况

stf.yaml: args: checkpoint_activations: True ## using gradient checkpointing model_parallel_size: 1 experiment_name: lora-disney mode: finetune load: "CogVideoX-2b-sat/transformer" no_load_rng: True train_iters: 200 eval_iters: 1 eval_interval: 100 eval_batch_size: 1 save: ckpts save_interval: 100 log_interval: 20 train_data: ["/data01/cly/dataset/sat_cogvideox_cly/selected_100"] valid_data: ["/data01/cly/dataset/sat_cogvideox_cly/selected_100"] split: 1,0,0 num_workers: 8 force_train: True only_log_video_latents: True

data: target: data_video.SFTDataset params: video_size: [480, 720] fps: 8 max_num_frames: 49 skip_frms_num: 3.

deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: False fp16: enabled: True loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas: [0.9, 0.95] eps: 1e-8 weight_decay: 1e-4 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false

model: scale_factor: 1.15258426 disable_first_stage_autocast: true not_trainable_prefixes: ['all'] ## Using Lora log_keys: - txt

denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: False
  weighting_config:
    target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
  scaling_config:
    target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
  discretization_config:
    target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
    params:
      shift_scale: 3.0
network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: True num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30
  transformer_args:
    checkpoint_activations: True ## using gradient checkpointing
    vocab_size: 1
    max_sequence_length: 64
    layernorm_order: pre
    skip_init: false
    model_parallel_size: 1
    is_decoder: false

  modules:
    pos_embed_config:
      target: dit_video_concat.Basic3DPositionEmbeddingMixin
      params:
        text_length: 226
        height_interpolation: 1.875
        width_interpolation: 1.875

    lora_config: ## Using Lora
      target: sat.model.finetune.lora2.LoraMixin
      params:
        r: 128

    patch_embed_config:
      target: dit_video_concat.ImagePatchEmbeddingMixin
      params:
        text_hidden_size: 4096

    adaln_layer_config:
      target: dit_video_concat.AdaLNMixin
      params:
        qk_ln: True

    final_layer_config:
      target: dit_video_concat.FinalLayerMixin
conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models: - is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: "CogVideoX-2b-sat/t5-v1_1-xxl" max_length: 226

first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" ignore_keys: [ 'loss' ]
  loss_config:
    target: torch.nn.Identity

  regularizer_config:
    target: vae_modules.regularizers.DiagonalGaussianRegularizer

  encoder_config:
    target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
    params:
      double_z: true
      z_channels: 16
      resolution: 256
      in_channels: 3
      out_ch: 3
      ch: 128
      ch_mult: [ 1, 2, 2, 4 ]
      attn_resolutions: [ ]
      num_res_blocks: 3
      dropout: 0.0
      gather_norm: True

  decoder_config:
    target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
    params:
      double_z: True
      z_channels: 16
      resolution: 256
      in_channels: 3
      out_ch: 3
      ch: 128
      ch_mult: [ 1, 2, 2, 4 ]
      attn_resolutions: [ ]
      num_res_blocks: 3
      dropout: 0.0
      gather_norm: false
loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: True num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0

sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: True
  discretization_config:
    target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
    params:
      shift_scale: 3.0

  guider_config:
    target: sgm.modules.diffusionmodules.guiders.DynamicCFG
    params:
      scale: 6
      exp: 5
      num_steps: 50
infer.yaml: args: latent_channels: 16 mode: inference load: "/data01/cly/project/CogVideo/sat/ckpts/lora-disney-08-15-10-45" batch_size: 1 input_type: txt input_file: configs/test.txt sampling_num_frames: 13 # Must be 13, 11 or 9 sampling_fps: 8 fp16: True output_dir: outputs_lora_04_100/ force_inference: True

model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys: - txt

denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: False
  weighting_config:
    target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
  scaling_config:
    target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
  discretization_config:
    target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
    params:
      shift_scale: 3.0
network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: True num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30
  transformer_args:
    vocab_size: 1
    max_sequence_length: 64
    layernorm_order: pre
    skip_init: false
    model_parallel_size: 1
    is_decoder: false

  modules:
    pos_embed_config:
      target: dit_video_concat.Basic3DPositionEmbeddingMixin
      params:
        text_length: 226
        height_interpolation: 1.875
        width_interpolation: 1.875

    lora_config: ## Using Lora
      target: sat.model.finetune.lora2.LoraMixin
      params:
        r: 128

    patch_embed_config:
      target: dit_video_concat.ImagePatchEmbeddingMixin
      params:
        text_hidden_size: 4096

    adaln_layer_config:
      target: dit_video_concat.AdaLNMixin
      params:
        qk_ln: True

    final_layer_config:
      target: dit_video_concat.FinalLayerMixin
conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models: - is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: "/data01/cly/project/CogVideo/sat/CogVideoX-2b-sat/t5-v1_1-xxl" max_length: 226

first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" ignore_keys: [ 'loss' ]
  loss_config:
    target: torch.nn.Identity

  regularizer_config:
    target: vae_modules.regularizers.DiagonalGaussianRegularizer

  encoder_config:
    target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
    params:
      double_z: true
      z_channels: 16
      resolution: 256
      in_channels: 3
      out_ch: 3
      ch: 128
      ch_mult: [ 1, 2, 2, 4 ]
      attn_resolutions: [ ]
      num_res_blocks: 3
      dropout: 0.0
      gather_norm: True

  decoder_config:
    target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
    params:
      double_z: True
      z_channels: 16
      resolution: 256
      in_channels: 3
      out_ch: 3
      ch: 128
      ch_mult: [ 1, 2, 2, 4 ]
      attn_resolutions: [ ]
      num_res_blocks: 3
      dropout: 0.0
      gather_norm: false
loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: True num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0

sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: True
  discretization_config:
    target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
    params:
      shift_scale: 3.0

  guider_config:
    target: sgm.modules.diffusionmodules.guiders.DynamicCFG
    params:
      scale: 6
      exp: 5
      num_steps: 50

我根据sat lora训练后的结果，自己在diffusers实现了一个CogVideoLoraLoaderMixin，将保存的to_q、to_k、to_v、to_out.0都格式转换后，推理出来也是类似的分布。

我根据sat lora训练后的结果，自己在diffusers实现了一个CogVideoLoraLoaderMixin，将保存的to_q、to_k、to_v、to_out.0都格式转换后，推理出来也是类似的分布。

这个CogVideoLoraLoaderMixin可以PR一下么。

THUDM / CogVideo