We propose Disentangled Audio-Visual System (DAVS)
to address arbitrary-subject talking face generation in this work, which aims to synthesize a sequence of face images
that correspond to given speech semantics, conditioning on either an unconstrained speech audio or video.
This repo is barely maintaining since the version of this code is out of date. If you are interested in the topic of Talking Face Generation, feel free to try the CODE of our CVPR2021 PAPER!
Download the pre-trained model checkpoint
Create the default folder "checkpoints" and put the checkpoint in it or get the CHECKPOINT_PATH
Samples for testing can be found in this folder named 0572_0019_0003. This is a pre-processed sample from the Voxceleb Dataset.
Run the testing script to generate videos from video:
python test_all.py --test_root ./0572_0019_0003/video --test_type video --test_audio_video_length 99 --test_resume_path CHECKPOINT_PATH
python test_all.py --test_root ./0572_0019_0003/audio --test_type audio --test_audio_video_length 99 --test_resume_path CHECKPOINT_PATH
Talking Effect on Human Characters
Talking Effect on Non-human Characters (Trained on Human Faces Only)
The face detection tool used in the demo videos can be found at RSA. It will return a Matfile with 5 key point locations in a row for each image. Other face alignment methods are also appliable such as dlib. The key points for face alignement we used are the two for the center of the eyes and the average point of the corners of the mouth. With each image's PATH and the face POINTS, you can find our way of face alignment at preprocess/face_align.py
.
Our preprocessing of the audio files is the same and borrowed from the matlab code of SyncNet. Then we save the mfcc features into bin files.
data
├── train, val, test
| ├── 0, 1, 2 ... 499 (one folder for each class)
| │ ├── 0, 1, 2 ... #videos per class
| │ │ ├── align_face256
| │ │ | ├── 0, 1, ... 28.jpg
| │ | ├── mfcc20
| │ │ | ├── 2, 3 ... 26.bin
where each video is extracted to frames and aligned using our protocol, and each audio is processed and saved using Matlab.
python train.py
The use of this software is RESTRICTED to non-commercial research and educational purposes.
@inproceedings{zhou2019talking,
title = {Talking Face Generation by Adversarially Disentangled Audio-Visual Representation},
author = {Zhou, Hang and Liu, Yu and Liu, Ziwei and Luo, Ping and Wang, Xiaogang},
booktitle = {AAAI Conference on Artificial Intelligence (AAAI)},
year = {2019},
}
The structure of this codebase is borrowed from pix2pix.