Doubiiu/CodeTalker - Githubissues

CodeTalker

Official PyTorch implementation for the paper:

CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior, CVPR 2023.

Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, Tien-Tsin Wong

We propose CodeTalker by casting speech-driven facial animation as a code query task in a finite proxy space of the learned codebook. Given the raw audio and a 3D neutral face template, our CodeTalker can produce vivid and realistic 3D facial motions with subtle expressions and accurate lip movements.

Changelog

2023.06.16 Provide a Colab online demo.
2023.04.03 Release code and model weights!

Environment

Linux
Python 3.6+
Pytorch 1.9.1
CUDA 11.1 (GPU with at least 11GB VRAM)

Other necessary packages:

pip install -r requirements.txt

ffmpeg
MPI-IS/mesh

IMPORTANT: Please make sure to modify the site-packages/torch/nn/modules/conv.py file by commenting out the self.padding_mode != 'zeros' line to allow for replicated padding for ConvTranspose1d as shown here.

Dataset Preparation

VOCASET

Request the VOCASET data from https://voca.is.tue.mpg.de/. Place the downloaded files data_verts.npy, raw_audio_fixed.pkl, templates.pkl and subj_seq_to_idx.pkl in the folder vocaset/. Download "FLAME_sample.ply" from voca and put it in vocaset/. Read the vertices/audio data and convert them to .npy/.wav files stored in vocaset/vertices_npy and vocaset/wav:

cd vocaset
python process_voca_data.py

BIWI

Follow the BIWI/README.md to preprocess BIWI dataset and put .npy/.wav files into BIWI/vertices_npy and BIWI/wav, and the templates.pkl into BIWI/.

Demo

Download the pretrained models from biwi_stage1.pth.tar & biwi_stage2.pth.tar and vocaset_stage1.pth.tar & vocaset_stage2.pth.tar. Put the pretrained models under BIWI and VOCASET folders, respectively. Given the audio signal,

to animate a mesh in FLAME topology, run:
```
sh scripts/demo.sh vocaset
```
to animate a mesh in BIWI topology, run:
```
sh scripts/demo.sh BIWI
```
This script will automatically generate the rendered videos in the demo/output folder. You can also put your own test audio file (.wav format) under the demo/wav folder and specify the arguments in DEMO section of config/<dataset>/demo.yaml accordingly (e.g., demo_wav_path, condition, subject, etc.).

Training / Testing

The training/testing operation shares a similar command:

sh scripts/<train.sh|test.sh> <exp_name> config/<vocaset|BIWI>/<stage1|stage2>.yaml <vocaset|BIWI> <s1|s2>

Please replace <exp_name> with your own experiment name, <vocaset|BIWI> by the name of your target dataset, i.e., vocaset or BIWI. Change the exp_dir in both scripts/train.sh and scripts/test.sh if needed. We just take an example for default commands below.

Training for Discrete Motion Prior

sh scripts/train.sh CodeTalker_s1 config/vocaset/stage1.yaml vocaset s1

Training for Speech-Driven Motion Synthesis

Make sure the paths of pre-trained models are correct, i.e., vqvae_pretrained_path and wav2vec2model_path in config/<vocaset|BIWI>/stage2.yaml.

sh scripts/train.sh CodeTalker_s2 config/vocaset/stage2.yaml vocaset s2

Testing

sh scripts/test.sh CodeTalker_s2 config/vocaset/stage2.yaml vocaset s2

Visualization with Audio

Modify the paths in scripts/render.sh and run:

sh scripts/render.sh

Evaluation on BIWI

We provide the reference code for Lip Vertex Error & Upper-face Dynamics Deviation. Remember to change the paths in scripts/cal_metric.sh, and run:

sh scripts/cal_metric.sh

Play with Your Own Data

Data Preparation

Create the dataset directory <dataset_dir> in CodeTalker directory.
Place your vertices data (.npy files) and audio data (.wav files) in <dataset_dir>/vertices_npy and <dataset_dir>/wav folders, respectively.
Save the templates of all subjects to a templates.pkl file and put it in <dataset_dir>, as done for BIWI and vocaset dataset. Export an arbitary template to .ply format and put it in <dataset_dir>/.

Training, Testing & Visualization

Create the corresponding config files in config/<dataset_dir> and modify the arguments in the config files.
Check all the code segments releated to dataset information.
Following the training/testing/visualization pipeline as done for BIWI and vocaset dataset.

Citation

If you find the code useful for your work, please star this repo and consider citing:

@inproceedings{xing2023codetalker,
  title={Codetalker: Speech-driven 3d facial animation with discrete motion prior},
  author={Xing, Jinbo and Xia, Menghan and Zhang, Yuechen and Cun, Xiaodong and Wang, Jue and Wong, Tien-Tsin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={12780--12790},
  year={2023}
}

Notes

Although our codebase allows for training with multi-GPUs, we did not test it and just hardcode the training batch size as one. You may need to change the data_loader if needed.

Acknowledgement

We heavily borrow the code from FaceFormer, Learn2Listen, and VOCA. Thanks for sharing their code and huggingface-transformers for their wav2vec2 implementation. We also gratefully acknowledge the ETHZ-CVL for providing the B3D(AC)2 dataset and MPI-IS for releasing the VOCASET dataset. Any third-party packages are owned by their respective authors and must be used under their respective licenses.