BadToBest / EchoMimic

EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditioning
https://badtobest.github.io/echomimic.html
Apache License 2.0
2.88k stars 337 forks source link

超过一分钟的长视频合成后,视频后半部分黑屏 #159

Open TonyEiven opened 1 month ago

TonyEiven commented 1 month ago
image

GPU: NVIDIA H20 音频长度: 1分30秒 音频格式: wav 图片格式: png 图片大小: 208K, 525x526, 25 fps, 25 tbr, 25 tbn

分支: main 配置:

configs/prompts/animation_acc.yaml

dependency models

pretrained_base_model_path: "./pretrained_weights/sd-image-variations-diffusers/" pretrained_vae_path: "./pretrained_weights/sd-vae-ft-mse/" audio_model_path: "./pretrained_weights/audio_processor/whisper_tiny.pt"

echo mimic checkpoint

denoising_unet_path: "./pretrained_weights/denoising_unet_acc.pth" reference_unet_path: "./pretrained_weights/reference_unet.pth" face_locator_path: "./pretrained_weights/face_locator.pth" motion_module_path: "./pretrained_weights/motion_module_acc.pth"

deonise model configs

inference_config: "./configs/inference/inference_v2.yaml" weight_dtype: 'fp16'

test cases

test_cases: "./assets/test_imgs/test.png":

configs/inference/inference_v2.yaml

unet_additional_kwargs: use_inflated_groupnorm: true unet_use_cross_frame_attention: false unet_use_temporal_attention: false use_motion_module: true cross_attention_dim: 384 motion_module_resolutions:

noise_scheduler_kwargs: beta_start: 0.00085 beta_end: 0.012 beta_schedule: "linear" clip_sample: false steps_offset: 1

Zero-SNR params

prediction_type: "v_prediction" rescale_betas_zero_snr: True timestep_spacing: "trailing"

sampler: DDIM

nitinmukesh commented 1 month ago

I believe you are using the accelerated model for inference. In infer_audio2vid_acc.py

parser.add_argument("-L", type=int, default=1200)

try putting a very large number for 'L' like 2100 and check.