choijeongsoo / lip2speech-unit

[Interspeech 2023] Intelligible Lip-to-Speech Synthesis with Speech Units
Other
21 stars 2 forks source link

Intelligible Lip-to-Speech Synthesis with Speech Units

Official PyTorch implementation for the following paper:

Intelligible Lip-to-Speech Synthesis with Speech Units
Jeongsoo Choi, Minsu Kim, Yong Man Ro
Interspeech 2023
[Paper] [Project]

Installation

conda create -y -n lip2speech python=3.10
conda activate lip2speech

git clone -b main --single-branch https://github.com/choijeongsoo/lip2speech-unit.git
cd lip2speech-unit

pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt

git clone https://github.com/facebookresearch/fairseq.git
cd fairseq
git checkout afc77bd
pip install -e ./
cd ..

Data Preparation

Video and Audio

Speech Units

Speaker Embedding

Mel-spectrogram

We provide sample data in 'datasets/lrs3' directory.

Model Checkpoints

Lip Reading Sentences 3 (LRS3)

| 1st stage | 2nd stage | STOI | ESTOI | PESQ | WER(%) | |:---------------:|:---------------:|:----:|:----:|:----:|:----:| | [Multi-target Lip2Speech](https://drive.google.com/file/d/1sFtoczuEmQaQXszCadCnNn6Itiohn5bN/view?usp=sharing) | [Multi-input Vocoder](https://drive.google.com/file/d/1WdbOFwUy-0eGvK2vT691ZsbqRAN9_Tgw/view?usp=sharing) | 0.552 | 0.354 | 1.31 | 50.4 | | [Multi-target Lip2Speech](https://drive.google.com/file/d/1sFtoczuEmQaQXszCadCnNn6Itiohn5bN/view?usp=sharing) | [Multi-input Vocoder
+ augmentation](https://drive.google.com/file/d/13zimLyyXluQ2RuXbBk2b3S9LnBnfLptj/view?usp=sharing) | 0.543 | 0.351 | 1.28 | 50.2 | | [Multi-target Lip2Speech
+ AV-HuBERT](https://drive.google.com/file/d/1oS80l6zpIfMTVKwvaHUSOC9ByjzGibSp/view?usp=sharing) | [Multi-input Vocoder
+ augmentation](https://drive.google.com/file/d/13zimLyyXluQ2RuXbBk2b3S9LnBnfLptj/view?usp=sharing) | 0.578 | 0.393 | 1.31 | 29.8 |
Lip Reading Sentences 2 (LRS2)

| 1st stage | 2nd stage | STOI | ESTOI | PESQ | WER(%) | |:---------------:|:---------------:|:----:|:----:|:----:|:----:| | [Multi-target Lip2Speech](https://drive.google.com/file/d/1aTv0e-TjD9AsVeijomCw_zAZxzE8Lhv-/view?usp=sharing) | [Multi-input Vocoder](https://drive.google.com/file/d/1tzI-LdOauWr6VC3zMHuL-HZcQu_QTCqX/view?usp=sharing) | | | | | | [Multi-target Lip2Speech](https://drive.google.com/file/d/1aTv0e-TjD9AsVeijomCw_zAZxzE8Lhv-/view?usp=sharing) | [Multi-input Vocoder
+ augmentation](https://drive.google.com/file/d/1WEZM0ICZdnafaC8ASwzIKMp_6fgUlYrs/view?usp=sharing) | 0.565 | 0.395 | 1.32 | 44.8 | | [Multi-target Lip2Speech
+ AV-HuBERT](https://drive.google.com/file/d/1meL4ZrSLgFEe0xh1yvejQxXunE88dvDf/view?usp=sharing) | [Multi-input Vocoder
+ augmentation](https://drive.google.com/file/d/1WEZM0ICZdnafaC8ASwzIKMp_6fgUlYrs/view?usp=sharing) | 0.585 | 0.412 | 1.34 | 35.7 |

We use the pre-trained AV-HuBERT Large (LRS3 + VoxCeleb2 (En)) model available from here.

For inference, download the checkpoints and place them in the 'checkpoints' directory.

Training

scripts/${DATASET}/train.sh

in 'multi_target_lip2speech' and 'multi_input_vocoder' directory

Inference

scripts/${DATASET}/inference.sh

in 'multi_target_lip2speech' and 'multi_input_vocoder' directory

Acknowledgement

This repository is built using Fairseq, AV-HuBERT, ESPnet, speech-resynthesis. We appreciate the open source of the projects.

Citation

If our work is useful for your research, please cite the following paper:

@article{choi2023intelligible,
      title={Intelligible Lip-to-Speech Synthesis with Speech Units},
      author={Jeongsoo Choi and Minsu Kim and Yong Man Ro},
      journal={arXiv preprint arXiv:2305.19603},
      year={2023},
}