Generating talking person portraits with arbitrary speech audio is a crucialproblem in the field of digital human and metaverse. A modern talking facegeneration method is expected to achieve the goals of generalized audio-lipsynchronization, good video quality, and high system efficiency. Recently,neural radiance field (NeRF) has become a popular rendering technique in thisfield since it could achieve high-fidelity and 3D-consistent talking facegeneration with a few-minute-long training video. However, there still existseveral challenges for NeRF-based methods: 1) as for the lip synchronization,it is hard to generate a long facial motion sequence of high temporalconsistency and audio-lip accuracy; 2) as for the video quality, due to thelimited data used to train the renderer, it is vulnerable to out-of-domaininput condition and produce bad rendering results occasionally; 3) as for thesystem efficiency, the slow training and inference speed of the vanilla NeRFseverely obstruct its usage in real-world applications. In this paper, wepropose GeneFace++ to handle these challenges by 1) utilizing the pitch contouras an auxiliary feature and introducing a temporal loss in the facial motionprediction process; 2) proposing a landmark locally linear embedding method toregulate the outliers in the predicted motion sequence to avoid robustnessissues; 3) designing a computationally efficient NeRF-based motion-to-videorenderer to achieves fast training and real-time inference. With thesesettings, GeneFace++ becomes the first NeRF-based method that achieves stableand real-time talking face generation with generalized audio-lipsynchronization. Extensive experiments show that our method outperformsstate-of-the-art baselines in terms of subjective and objective evaluation.Video samples are available at https://genefaceplusplus.github.io .
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)