Closed 1059692261 closed 1 year ago
你也可以把这些片段剪掉
抖个机灵,别在意
Hi, thanks for your attention and great observation. The methods requiring a template video as input tend to face the problem mentioned above.In my view, this problem may have multiple reasons. For example, the input template video during inference may affect the generated result. Such as in Wav2Lip, upon careful examination, you can find that it uses the t-th frame of the input template video as reference image to generate the t-th frame. The lip shape of the t-th frame in input video may affect the generated lip shape. This may be resulted by the skip connection of their U-Net-based network. Our framework does not use this design and avoided this issue. Another reason may be the inaccurate mapping from audio to landmark/video at some time when the audio is silent. It may be related to the training dataset because it contains very few segments of silence. But I am not sure if it will work, you can try it~
感谢回复!
非常感谢作者能开源这么优秀的项目! 我使用自己的视频及音频做了下推理测试,发现了类似于wav2lip的问题,即是当出现音频停下的静默片段,人物的嘴仍然会露出原始嘴型(可能嘴是闭上的,但是嘴角仍然按照原始嘴型在抽动)。想请问下作者这个问题是否有解决思路?如果在训练时添加这样静默的数据,是否会改善这个问题?