Closed tony-wolff closed 1 year ago
Hi We didn't test the model in a real-time setting. In principle, the model obtains the global context of the audio first (about 4s in our experiments) and then autoregressive synthesizes 3d facial animation. That means the performance may be dropped if you only provide a small window of audio in a real-time setting. This could be a limitation and need to be further explored. Previous works, e.g. VOCA and MeshTalk may be suitable for realtime applications as they adopted small audio windows in their methods.
Hello ! The work is impressive! I wonder if it would be proficient to use with real time generated TTS and produce realistic facial animation on a 3D face model in Unity.