为什么lora微调后进行推理,会出现这种结果 #113

Open SCHfighting opened 1 month ago

SCHfighting commented 1 month ago

prompt是:In the video, we see a monkey sitting on a rock by a pond. The monkey is seen in various states of repose, with its reflection visible in the water. The scene is serene and peaceful, with the monkey's fur and the surrounding foliage adding to the tranquility. The lighting is soft, and the colors are muted, creating a calm atmosphere. The monkey appears to be in a natural habitat, possibly a park or wildlife sanctuary, and the setting suggests a quiet moment in the life of this animal.

tengjiayan20 commented 1 month ago

You need to modify the inference config: keep the network_config the same as sft config, i.e. add lora_config to the inference config. To avoid misunderstanding, we will update a new inference config for lora inference soon. WX20240812-140332

SCHfighting commented 1 month ago

ok,thank you!

cly2625 commented 4 weeks ago

Snipaste_2024-08-15_11-19-45 我加上参数了,但是出现了这种情况

stf.yaml: args: checkpoint_activations: True ## using gradient checkpointing model_parallel_size: 1 experiment_name: lora-disney mode: finetune load: "CogVideoX-2b-sat/transformer" no_load_rng: True train_iters: 200 eval_iters: 1 eval_interval: 100 eval_batch_size: 1 save: ckpts save_interval: 100 log_interval: 20 train_data: ["/data01/cly/dataset/sat_cogvideox_cly/selected_100"] valid_data: ["/data01/cly/dataset/sat_cogvideox_cly/selected_100"] split: 1,0,0 num_workers: 8 force_train: True only_log_video_latents: True

data: target: data_video.SFTDataset params: video_size: [480, 720] fps: 8 max_num_frames: 49 skip_frms_num: 3.

deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: False fp16: enabled: True loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas: [0.9, 0.95] eps: 1e-8 weight_decay: 1e-4 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false

model: scale_factor: 1.15258426 disable_first_stage_autocast: true not_trainable_prefixes: ['all'] ## Using Lora log_keys:

infer.yaml: args: latent_channels: 16 mode: inference load: "/data01/cly/project/CogVideo/sat/ckpts/lora-disney-08-15-10-45" batch_size: 1 input_type: txt input_file: configs/test.txt sampling_num_frames: 13 # Must be 13, 11 or 9 sampling_fps: 8 fp16: True output_dir: outputs_lora_04_100/ force_inference: True

model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys:

tengjiayan20 commented 4 weeks ago

Have you seen your loss when fine-tuning? May it is Nan?

cly2625 commented 4 weeks ago

Have you seen your loss when fine-tuning? May it is Nan? The loss is acceptable.

The issue is occurring in sat/SwissArmyTransformer/sat/training/ at Line 224: model._save_checkpoint(save_dir, tag, client_state=client_state, exclude_frozen_parameters=True). I added exclude_frozen_parameters=True, but I encountered problems when loading the model for inference. How can I correctly load this part of the model weights?

I followed the modifications suggested in this thread:

tengjiayan20 commented 2 weeks ago

We do not suggest adding exclude_frozen_parameters=True, unless you know how to recover the entire model weight. The recover method we will update in the future.

octopusszzy commented 2 weeks ago

我根据sat lora训练后的结果,自己在diffusers实现了一个CogVideoLoraLoaderMixin,将保存的to_q、to_k、to_v、to_out.0都格式转换后,推理出来也是类似的分布。

glide-the commented 3 days ago

我根据sat lora训练后的结果,自己在diffusers实现了一个CogVideoLoraLoaderMixin,将保存的to_q、to_k、to_v、to_out.0都格式转换后,推理出来也是类似的分布。
