Open SCHfighting opened 1 month ago
You need to modify the inference config: keep the network_config the same as sft config, i.e. add lora_config to the inference config. To avoid misunderstanding, we will update a new inference config for lora inference soon.
ok,thank you!
我加上参数了,但是出现了这种情况
stf.yaml: args: checkpoint_activations: True ## using gradient checkpointing model_parallel_size: 1 experiment_name: lora-disney mode: finetune load: "CogVideoX-2b-sat/transformer" no_load_rng: True train_iters: 200 eval_iters: 1 eval_interval: 100 eval_batch_size: 1 save: ckpts save_interval: 100 log_interval: 20 train_data: ["/data01/cly/dataset/sat_cogvideox_cly/selected_100"] valid_data: ["/data01/cly/dataset/sat_cogvideox_cly/selected_100"] split: 1,0,0 num_workers: 8 force_train: True only_log_video_latents: True
data: target: data_video.SFTDataset params: video_size: [480, 720] fps: 8 max_num_frames: 49 skip_frms_num: 3.
deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: False fp16: enabled: True loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas: [0.9, 0.95] eps: 1e-8 weight_decay: 1e-4 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false
model: scale_factor: 1.15258426 disable_first_stage_autocast: true not_trainable_prefixes: ['all'] ## Using Lora log_keys:
txt
denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: False
weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0
network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: True num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30
transformer_args: checkpoint_activations: True ## using gradient checkpointing vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false
modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875
lora_config: ## Using Lora
target: sat.model.finetune.lora2.LoraMixin
params:
r: 128
patch_embed_config:
target: dit_video_concat.ImagePatchEmbeddingMixin
params:
text_hidden_size: 4096
adaln_layer_config:
target: dit_video_concat.AdaLNMixin
params:
qk_ln: True
final_layer_config:
target: dit_video_concat.FinalLayerMixin
conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models:
first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" ignore_keys: [ 'loss' ]
loss_config: target: torch.nn.Identity
regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer
encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: True
decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: True z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: false
loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: True num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0
sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: True
discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0
guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50
infer.yaml: args: latent_channels: 16 mode: inference load: "/data01/cly/project/CogVideo/sat/ckpts/lora-disney-08-15-10-45" batch_size: 1 input_type: txt input_file: configs/test.txt sampling_num_frames: 13 # Must be 13, 11 or 9 sampling_fps: 8 fp16: True output_dir: outputs_lora_04_100/ force_inference: True
model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys:
txt
denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: False
weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0
network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: True num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30
transformer_args: vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false
modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875
lora_config: ## Using Lora
target: sat.model.finetune.lora2.LoraMixin
params:
r: 128
patch_embed_config:
target: dit_video_concat.ImagePatchEmbeddingMixin
params:
text_hidden_size: 4096
adaln_layer_config:
target: dit_video_concat.AdaLNMixin
params:
qk_ln: True
final_layer_config:
target: dit_video_concat.FinalLayerMixin
conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models:
first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" ignore_keys: [ 'loss' ]
loss_config: target: torch.nn.Identity
regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer
encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: True
decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: True z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: false
loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: True num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0
sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: True
discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0
guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50
Have you seen your loss when fine-tuning? May it is Nan?
Have you seen your loss when fine-tuning? May it is Nan? The loss is acceptable.
The issue is occurring in sat/SwissArmyTransformer/sat/training/model_io.py
at Line 224: model._save_checkpoint(save_dir, tag, client_state=client_state, exclude_frozen_parameters=True)
. I added exclude_frozen_parameters=True
, but I encountered problems when loading the model for inference. How can I correctly load this part of the model weights?
I followed the modifications suggested in this thread: https://github.com/THUDM/CogVideo/issues/126#issuecomment-2286688314.
Have you seen your loss when fine-tuning? May it is Nan? The loss is acceptable.
The issue is occurring in
sat/SwissArmyTransformer/sat/training/model_io.py
at Line 224:model._save_checkpoint(save_dir, tag, client_state=client_state, exclude_frozen_parameters=True)
. I addedexclude_frozen_parameters=True
, but I encountered problems when loading the model for inference. How can I correctly load this part of the model weights?I followed the modifications suggested in this thread: #126 (comment).
We do not suggest adding exclude_frozen_parameters=True, unless you know how to recover the entire model weight. The recover method we will update in the future.
我加上参数了,但是出现了这种情况
stf.yaml: args: checkpoint_activations: True ## using gradient checkpointing model_parallel_size: 1 experiment_name: lora-disney mode: finetune load: "CogVideoX-2b-sat/transformer" no_load_rng: True train_iters: 200 eval_iters: 1 eval_interval: 100 eval_batch_size: 1 save: ckpts save_interval: 100 log_interval: 20 train_data: ["/data01/cly/dataset/sat_cogvideox_cly/selected_100"] valid_data: ["/data01/cly/dataset/sat_cogvideox_cly/selected_100"] split: 1,0,0 num_workers: 8 force_train: True only_log_video_latents: True
data: target: data_video.SFTDataset params: video_size: [480, 720] fps: 8 max_num_frames: 49 skip_frms_num: 3.
deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: False fp16: enabled: True loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas: [0.9, 0.95] eps: 1e-8 weight_decay: 1e-4 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false
model: scale_factor: 1.15258426 disable_first_stage_autocast: true not_trainable_prefixes: ['all'] ## Using Lora log_keys: - txt
denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: False
weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0
network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: True num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30
transformer_args: checkpoint_activations: True ## using gradient checkpointing vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875 lora_config: ## Using Lora target: sat.model.finetune.lora2.LoraMixin params: r: 128 patch_embed_config: target: dit_video_concat.ImagePatchEmbeddingMixin params: text_hidden_size: 4096 adaln_layer_config: target: dit_video_concat.AdaLNMixin params: qk_ln: True final_layer_config: target: dit_video_concat.FinalLayerMixin
conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models: - is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: "CogVideoX-2b-sat/t5-v1_1-xxl" max_length: 226
first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" ignore_keys: [ 'loss' ]
loss_config: target: torch.nn.Identity regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: True decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: True z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: false
loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: True num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0
sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: True
discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50
infer.yaml: args: latent_channels: 16 mode: inference load: "/data01/cly/project/CogVideo/sat/ckpts/lora-disney-08-15-10-45" batch_size: 1 input_type: txt input_file: configs/test.txt sampling_num_frames: 13 # Must be 13, 11 or 9 sampling_fps: 8 fp16: True output_dir: outputs_lora_04_100/ force_inference: True
model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys: - txt
denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: False
weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0
network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: True num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30
transformer_args: vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875 lora_config: ## Using Lora target: sat.model.finetune.lora2.LoraMixin params: r: 128 patch_embed_config: target: dit_video_concat.ImagePatchEmbeddingMixin params: text_hidden_size: 4096 adaln_layer_config: target: dit_video_concat.AdaLNMixin params: qk_ln: True final_layer_config: target: dit_video_concat.FinalLayerMixin
conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models: - is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: "/data01/cly/project/CogVideo/sat/CogVideoX-2b-sat/t5-v1_1-xxl" max_length: 226
first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" ignore_keys: [ 'loss' ]
loss_config: target: torch.nn.Identity regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: True decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: True z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: false
loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: True num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0
sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: True
discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50
我根据sat lora训练后的结果,自己在diffusers实现了一个CogVideoLoraLoaderMixin,将保存的to_q、to_k、to_v、to_out.0都格式转换后,推理出来也是类似的分布。
我根据sat lora训练后的结果,自己在diffusers实现了一个CogVideoLoraLoaderMixin,将保存的to_q、to_k、to_v、to_out.0都格式转换后,推理出来也是类似的分布。
这个CogVideoLoraLoaderMixin可以PR一下么。
System Info / 系統信息
https://github.com/user-attachments/assets/e8c501fb-53d5-4377-9e27-0824e864123a
prompt是:In the video, we see a monkey sitting on a rock by a pond. The monkey is seen in various states of repose, with its reflection visible in the water. The scene is serene and peaceful, with the monkey's fur and the surrounding foliage adding to the tranquility. The lighting is soft, and the colors are muted, creating a calm atmosphere. The monkey appears to be in a natural habitat, possibly a park or wildlife sanctuary, and the setting suggests a quiet moment in the life of this animal.
Information / 问题信息
Reproduction / 复现过程
完全遵守readme中的步骤,使用自己收集的若干视频,并且使用CogVLM2-Video进行的打标
Expected behavior / 期待表现
希望可以提供清晰的lora微调和加载lora模型的推理步骤,感谢