URL

https://arxiv.org/abs/2305.00787
Affiliations
- Zhenhui Ye, N/A
- Jinzheng He, N/A
- Ziyue Jiang, N/A
- Rongjie Huang, N/A
- Jiawei Huang, N/A
- Jinglin Liu, N/A
- Yi Ren, N/A
- Xiang Yin, N/A
- Zejun Ma, N/A
- Zhou Zhao, N/A
  Abstract
- Generating talking person portraits with arbitrary speech audio is a crucialproblem in the field of digital human and metaverse. A modern talking facegeneration method is expected to achieve the goals of generalized audio-lipsynchronization, good video quality, and high system efficiency. Recently,neural radiance field (NeRF) has become a popular rendering technique in thisfield since it could achieve high-fidelity and 3D-consistent talking facegeneration with a few-minute-long training video. However, there still existseveral challenges for NeRF-based methods: 1) as for the lip synchronization,it is hard to generate a long facial motion sequence of high temporalconsistency and audio-lip accuracy; 2) as for the video quality, due to thelimited data used to train the renderer, it is vulnerable to out-of-domaininput condition and produce bad rendering results occasionally; 3) as for thesystem efficiency, the slow training and inference speed of the vanilla NeRFseverely obstruct its usage in real-world applications. In this paper, wepropose GeneFace++ to handle these challenges by 1) utilizing the pitch contouras an auxiliary feature and introducing a temporal loss in the facial motionprediction process; 2) proposing a landmark locally linear embedding method toregulate the outliers in the predicted motion sequence to avoid robustnessissues; 3) designing a computationally efficient NeRF-based motion-to-videorenderer to achieves fast training and real-time inference. With thesesettings, GeneFace++ becomes the first NeRF-based method that achieves stableand real-time talking face generation with generalized audio-lipsynchronization. Extensive experiments show that our method outperformsstate-of-the-art baselines in terms of subjective and objective evaluation.Video samples are available at https://genefaceplusplus.github.io .
  Translation (by gpt-3.5-turbo)
任意の音声に対して話す人物のポートレートを生成することは、デジタルヒューマンやメタバースの分野において重要な問題です。現代の話す顔生成手法は、一般化された音声と口唇同期、良好なビデオ品質、高いシステム効率の目標を達成することが期待されています。最近、ニューラル放射場（NeRF）は、数分間のトレーニングビデオで高品質で3D一貫性のある話す顔生成を実現できるため、この分野で人気のあるレンダリング技術となっています。しかし、NeRFベースの手法には、以下の課題がまだ存在しています。1）口唇同期に関しては、高い時間的一貫性と音声と口唇の正確さを持つ長い顔の動きのシーケンスを生成することが困難です。2）ビデオ品質に関しては、レンダラーのトレーニングに使用されるデータが限られているため、ドメイン外の入力条件に対して脆弱で、時々悪いレンダリング結果を生成します。3）システム効率に関しては、バニラNeRFの遅いトレーニングと推論速度は、実世界のアプリケーションでの使用を妨げています。本論文では、GeneFace++を提案し、これらの課題に対処します。1）補助機能としてピッチ輪郭を利用し、顔の動き予測プロセスに時間的損失を導入することで、口唇同期を実現します。2）予測された動きのシーケンスの外れ値を調整するために、ランドマークの局所線形埋め込み法を提案し、頑健性の問題を回避します。3）計算効率の高いNeRFベースの動きからビデオへのレンダラーを設計し、高速なトレーニングとリアルタイム推論を実現します。これらの設定により、GeneFace++は、一般化された音声と口唇同期を持つ安定したリアルタイム話す顔生成を実現する最初のNeRFベースの手法となりました。徹底的な実験により、主観的および客観的評価において、提案手法が最先端のベースラインを上回ることが示されました。ビデオサンプルはhttps://genefaceplusplus.github.ioで利用可能です。
Summary (by gpt-3.5-turbo)
本研究では、話す人物のポートレートを生成するためのNeRFベースの手法における課題を解決するために、GeneFace++を提案した。GeneFace++は、ピッチ輪郭を利用して口唇同期を実現し、局所線形埋め込み法を提案して頑健性の問題を回避し、高速なトレーニングとリアルタイム推論を実現するNeRFベースの動きからビデオへのレンダラーを設計することで、一般化された音声と口唇同期を持つ安定したリアルタイム話す顔生成を実現した。徹底的な実験により、提案手法が最先端のベースラインを上回ることが示された。ビデオサンプルはhttps://genefaceplusplus.github.ioで利用可能。

AkihikoWatanabe / paper_notes

GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation, Zhenhui Ye+, N/A, arXiv'23 #631

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)