超过一分钟的长视频合成后，视频后半部分黑屏

GPU: NVIDIA H20 音频长度: 1分30秒音频格式: wav 图片格式: png 图片大小: 208K, 525x526, 25 fps, 25 tbr, 25 tbn

分支: main 配置:

configs/prompts/animation_acc.yaml

dependency models

pretrained_base_model_path: "./pretrained_weights/sd-image-variations-diffusers/" pretrained_vae_path: "./pretrained_weights/sd-vae-ft-mse/" audio_model_path: "./pretrained_weights/audio_processor/whisper_tiny.pt"

echo mimic checkpoint

denoising_unet_path: "./pretrained_weights/denoising_unet_acc.pth" reference_unet_path: "./pretrained_weights/reference_unet.pth" face_locator_path: "./pretrained_weights/face_locator.pth" motion_module_path: "./pretrained_weights/motion_module_acc.pth"

deonise model configs

inference_config: "./configs/inference/inference_v2.yaml" weight_dtype: 'fp16'

test cases

test_cases: "./assets/test_imgs/test.png":

"./assets/test_audios/test.wav"

configs/inference/inference_v2.yaml

unet_additional_kwargs: use_inflated_groupnorm: true unet_use_cross_frame_attention: false unet_use_temporal_attention: false use_motion_module: true cross_attention_dim: 384 motion_module_resolutions:

1
2
4
8 motion_module_mid_block: true motion_module_decoder_only: false motion_module_type: Vanilla motion_module_kwargs: num_attention_heads: 8 num_transformer_block: 1 attention_block_types:
- Temporal_Self
- Temporal_Self temporal_position_encoding: true temporal_position_encoding_max_len: 32 temporal_attention_dim_div: 1

noise_scheduler_kwargs: beta_start: 0.00085 beta_end: 0.012 beta_schedule: "linear" clip_sample: false steps_offset: 1

Zero-SNR params

prediction_type: "v_prediction" rescale_betas_zero_snr: True timestep_spacing: "trailing"

sampler: DDIM

antgroup / echomimic