Rudrabha / Wav2Lip

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs
https://synclabs.so
9.8k stars 2.13k forks source link

Wav2Lip: Accurately Lip-syncing Videos In The Wild

Wav2Lip is hosted for free at Sync Labs

Are you looking to integrate this into a product? We have a turn-key hosted API with new and improved lip-syncing models here: https://synclabs.so/

For any other commercial / enterprise requests, please contact us at pavan@synclabs.so and prady@synclabs.so

To reach out to the authors directly you can reach us at prajwal@synclabs.so, rudrabha@synclabs.so.

This code is part of the paper: A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild published at ACM Multimedia 2020.

PWC PWC PWC

šŸ“‘ Original Paper šŸ“° Project Page šŸŒ€ Demo āš” Live Testing šŸ“” Colab Notebook
Paper Project Page Demo Video Interactive Demo Colab Notebook /Updated Collab Notebook

Logo


Highlights


Disclaimer

All results from this open-source code or our demo website should only be used for research/academic/personal purposes only. As the models are trained on the LRS2 dataset, any form of commercial use is strictly prohibited. For commercial requests please contact us directly!

Prerequisites

Getting the weights

Model Description Link to the model
Wav2Lip Highly accurate lip-sync Link
Wav2Lip + GAN Slightly inferior lip-sync, but better visual quality Link
Expert Discriminator Weights of the expert discriminator Link
Visual Quality Discriminator Weights of the visual disc trained in a GAN setup Link

Lip-syncing videos using the pre-trained models (Inference)

You can lip-sync any video to any audio:

python inference.py --checkpoint_path <ckpt> --face <video.mp4> --audio <an-audio-source> 

The result is saved (by default) in results/result_voice.mp4. You can specify it as an argument, similar to several other available options. The audio source can be any file supported by FFMPEG containing audio data: *.wav, *.mp3 or even a video file, from which the code will automatically extract the audio.

Tips for better results:

Preparing LRS2 for training

Our models are trained on LRS2. See here for a few suggestions regarding training on other datasets.

LRS2 dataset folder structure
data_root (mvlrs_v1)
ā”œā”€ā”€ main, pretrain (we use only main folder in this work)
|   ā”œā”€ā”€ list of folders
|   ā”‚   ā”œā”€ā”€ five-digit numbered video IDs ending with (.mp4)

Place the LRS2 filelists (train, val, test) .txt files in the filelists/ folder.

Preprocess the dataset for fast training
python preprocess.py --data_root data_root/main --preprocessed_root lrs2_preprocessed/

Additional options like batch_size and the number of GPUs to use in parallel to use can also be set.

Preprocessed LRS2 folder structure
preprocessed_root (lrs2_preprocessed)
ā”œā”€ā”€ list of folders
|   ā”œā”€ā”€ Folders with five-digit numbered video IDs
|   ā”‚   ā”œā”€ā”€ *.jpg
|   ā”‚   ā”œā”€ā”€ audio.wav

Train!

There are two major steps: (i) Train the expert lip-sync discriminator, (ii) Train the Wav2Lip model(s).

Training the expert discriminator

You can download the pre-trained weights if you want to skip this step. To train it:

python color_syncnet_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints>
Training the Wav2Lip models

You can either train the model without the additional visual quality discriminator (< 1 day of training) or use the discriminator (~2 days). For the former, run:

python wav2lip_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints> --syncnet_checkpoint_path <path_to_expert_disc_checkpoint>

To train with the visual quality discriminator, you should run hq_wav2lip_train.py instead. The arguments for both files are similar. In both cases, you can resume training as well. Look at python wav2lip_train.py --help for more details. You can also set additional less commonly-used hyper-parameters at the bottom of the hparams.py file.

Training on datasets other than LRS2

Training on other datasets might require modifications to the code. Please read the following before you raise an issue:

When raising an issue on this topic, please let us know that you are aware of all these points.

We have an HD model trained on a dataset allowing commercial usage. The size of the generated face will be 192 x 288 in our new model.

Evaluation

Please check the evaluation/ folder for the instructions.

License and Citation

This repository can only be used for personal/research/non-commercial purposes. However, for commercial requests, please contact us directly at rudrabha@synclabs.so or prajwal@synclabs.so. We have a turn-key hosted API with new and improved lip-syncing models here: https://synclabs.so/ The size of the generated face will be 192 x 288 in our new models. Please cite the following paper if you use this repository:

@inproceedings{10.1145/3394171.3413532,
author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
title = {A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild},
year = {2020},
isbn = {9781450379885},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3394171.3413532},
doi = {10.1145/3394171.3413532},
booktitle = {Proceedings of the 28th ACM International Conference on Multimedia},
pages = {484ā€“492},
numpages = {9},
keywords = {lip sync, talking face generation, video generation},
location = {Seattle, WA, USA},
series = {MM '20}
}

Acknowledgments

Parts of the code structure are inspired by this TTS repository. We thank the author for this wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models. We thank zabique for the tutorial collab notebook.

Acknowledgements