Fictionarry / TalkingGaussian

[ECCV'24] TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting
https://fictionarry.github.io/TalkingGaussian/
124 stars 16 forks source link

Code Release Estimate #3

Open nazmicancalik opened 3 weeks ago

nazmicancalik commented 3 weeks ago

Hey guys,

Great work! I wanted to ask kindly for an update on the training code and the weights if possible. Would love to recreate your work :) Looking forward!

Best

Fictionarry commented 3 weeks ago

Hi, the code is updated now :)

johndpope commented 3 weeks ago

nice.

is there a concrete reason for deepspeech vs wav2vec usage?

https://arxiv.org/pdf/2404.10667 it seems like Microsoft has solved this lipsync problem with CAPP (though not opensource) i message Microsoft to open source this CAPP model (email is on VASA whitepaper)

• Audio-pose alignment. Measuring the alignment between the generated head poses and input audio is not trivial and there are no well-established metrics. A few recent studies [[72](https://ar5iv.labs.arxiv.org/html/2404.10667#bib.bib72), [50](https://ar5iv.labs.arxiv.org/html/2404.10667#bib.bib50)] employed the Beat Align Score [[43](https://ar5iv.labs.arxiv.org/html/2404.10667#bib.bib43)] to evaluate audio-pose alignment. However, this metric is not optimal because the concept of a “beat” in the context of natural speech and human head motion is ambiguous. In this work, we introduce a new data-driven metric called Contrastive Audio and Pose Pretraining (CAPP) score. Inspired by CLIP [[38](https://ar5iv.labs.arxiv.org/html/2404.10667#bib.bib38)], we jointly train a pose sequence encoder and an audio sequence encoder and predict whether the input pose sequence and audio are paired. The audio encoder is initialized from a pretrained Wav2Vec2 network [[2](https://ar5iv.labs.arxiv.org/html/2404.10667#bib.bib2)] and the pose encoder is a randomly initialized 6-layer transformer network. The input window size is 3 seconds. Our CAPP model is trained on 2K hours of real-life audio and pose sequences, and demonstrates a robust capability to assess the degree of synchronization between audio inputs and generate poses (see Sec. [4.3](https://ar5iv.labs.arxiv.org/html/2404.10667#S4.SS3)).

i run the vanilla install with new environment - grid encoder complains in needs this https://github.com/NVIDIA/cub which has been superceded. also getting headaches with guassian rendering not working with cuda12 RuntimeError: The detected CUDA version (12.5) mismatches the version that was used to compile PyTorch (11.3). Please make sure to use the same CUDA versions.

Fictionarry commented 2 weeks ago

nice. ... i run the vanilla install with new environment - grid encoder complains in needs this https://github.com/NVIDIA/cub which has been superceded. also getting headaches with guassian rendering not working with cuda12 RuntimeError: The detected CUDA version (12.5) mismatches the version that was used to compile PyTorch (11.3). Please make sure to use the same CUDA versions.

Hi, DeepSpeech, Wav2Vec and HuBERT are all used as a basic audio feature extractor to get robust and generative audio representation, as they are pre-trained on other audio tasks. The usage of DeepSpeech in our main experiments is just for a fair comparison as most previous methods also use it. Besides, it seems that CAPP is designed to solve the problem of audio-driven head pose generation but not lip-sync.

The deprecation of CUB for high-version CUDA may not be easy to solve, as it is also included in Gaussian Splatting components. Maybe you need to change CUDA version (like 11.7) in both the global and conda environments to install the environment.