Official PyTorch implementation for the paper:
CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior, CVPR 2023.
Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, Tien-Tsin Wong
We propose CodeTalker by casting speech-driven facial animation as a code query task in a finite proxy space of the learned codebook. Given the raw audio and a 3D neutral face template, our CodeTalker can produce vivid and realistic 3D facial motions with subtle expressions and accurate lip movements.
Other necessary packages:
pip install -r requirements.txt
IMPORTANT: Please make sure to modify the site-packages/torch/nn/modules/conv.py
file by commenting out the self.padding_mode != 'zeros'
line to allow for replicated padding for ConvTranspose1d as shown here.
Request the VOCASET data from https://voca.is.tue.mpg.de/. Place the downloaded files data_verts.npy
, raw_audio_fixed.pkl
, templates.pkl
and subj_seq_to_idx.pkl
in the folder vocaset/
. Download "FLAME_sample.ply" from voca and put it in vocaset/
. Read the vertices/audio data and convert them to .npy/.wav files stored in vocaset/vertices_npy
and vocaset/wav
:
cd vocaset
python process_voca_data.py
Follow the BIWI/README.md
to preprocess BIWI dataset and put .npy/.wav files into BIWI/vertices_npy
and BIWI/wav
, and the templates.pkl
into BIWI/
.
Download the pretrained models from biwi_stage1.pth.tar & biwi_stage2.pth.tar and vocaset_stage1.pth.tar & vocaset_stage2.pth.tar. Put the pretrained models under BIWI
and VOCASET
folders, respectively. Given the audio signal,
sh scripts/demo.sh vocaset
sh scripts/demo.sh BIWI
This script will automatically generate the rendered videos in the demo/output
folder. You can also put your own test audio file (.wav format) under the demo/wav
folder and specify the arguments in DEMO
section of config/<dataset>/demo.yaml
accordingly (e.g., demo_wav_path
, condition
, subject
, etc.).
The training/testing operation shares a similar command:
sh scripts/<train.sh|test.sh> <exp_name> config/<vocaset|BIWI>/<stage1|stage2>.yaml <vocaset|BIWI> <s1|s2>
Please replace <exp_name>
with your own experiment name, <vocaset|BIWI>
by the name of your target dataset, i.e., vocaset
or BIWI
. Change the exp_dir
in both scripts/train.sh
and scripts/test.sh
if needed. We just take an example for default commands below.
sh scripts/train.sh CodeTalker_s1 config/vocaset/stage1.yaml vocaset s1
Make sure the paths of pre-trained models are correct, i.e., vqvae_pretrained_path
and wav2vec2model_path
in config/<vocaset|BIWI>/stage2.yaml
.
sh scripts/train.sh CodeTalker_s2 config/vocaset/stage2.yaml vocaset s2
sh scripts/test.sh CodeTalker_s2 config/vocaset/stage2.yaml vocaset s2
Modify the paths in scripts/render.sh
and run:
sh scripts/render.sh
We provide the reference code for Lip Vertex Error & Upper-face Dynamics Deviation. Remember to change the paths in scripts/cal_metric.sh
, and run:
sh scripts/cal_metric.sh
Create the dataset directory <dataset_dir>
in CodeTalker
directory.
Place your vertices data (.npy files) and audio data (.wav files) in <dataset_dir>/vertices_npy
and <dataset_dir>/wav
folders, respectively.
Save the templates of all subjects to a templates.pkl
file and put it in <dataset_dir>
, as done for BIWI and vocaset dataset. Export an arbitary template to .ply format and put it in <dataset_dir>/
.
Create the corresponding config files in config/<dataset_dir>
and modify the arguments in the config files.
Check all the code segments releated to dataset information.
Following the training/testing/visualization pipeline as done for BIWI and vocaset dataset.
If you find the code useful for your work, please star this repo and consider citing:
@inproceedings{xing2023codetalker,
title={Codetalker: Speech-driven 3d facial animation with discrete motion prior},
author={Xing, Jinbo and Xia, Menghan and Zhang, Yuechen and Cun, Xiaodong and Wang, Jue and Wong, Tien-Tsin},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={12780--12790},
year={2023}
}
data_loader
if needed.We heavily borrow the code from FaceFormer, Learn2Listen, and VOCA. Thanks for sharing their code and huggingface-transformers for their wav2vec2 implementation. We also gratefully acknowledge the ETHZ-CVL for providing the B3D(AC)2 dataset and MPI-IS for releasing the VOCASET dataset. Any third-party packages are owned by their respective authors and must be used under their respective licenses.