We provide code for the reproduction of the main results in Jointly Learning Visual and Auditory Speech Representations from Raw Data and BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition . Our implementation is based on PyTorch Lightning.
conda env create -f environment.yml
. Change the environment prefix to match the location of miniconda3, if necessary.
python preprocessing/extract_mouths.py --src_dir ${SOURCE_DIR} --tgt_dir ${TARGET_DIR} --landmarks_dir ${LANDMARKS_DIR}
Below are the checkpoints of the Base and Large models pre-trained with RAVEn on LRS3+Vox2-en.
Model | Modality | Checkpoint |
---|---|---|
Base | Video | Download |
Base | Audio | Download |
Large | Video | Download |
Large | Audio | Download |
Below are the checkpoints of the Base, Base+, and Large models pre-trained with BRAVEn.
Model | Modality | Checkpoint |
---|---|---|
Base (LRS3) | Video | Download |
Base (LRS3) | Audio | Download |
Base+ (LRS3+Vox2) | Video | Download |
Base+ (LRS3+Vox2) | Audio | Download |
Large (LRS3+Vox2+AVS) | Video | Download |
Large (LRS3+Vox2+AVS) | Audio | Download |
Below are the checkpoints corresponding to Tables 1 and 2 for VSR and ASR on LRS3. Models are provided for both low- and high-resource labelled data settings. In the high-resource setting, the models are fine-tuned on the full LRS3 dataset (433 hours). In the low-resource setting, they are fine-tuned on a subset ("trainval") of LRS3 (30 hours).
In some cases, the models were re-trained so the WER may differ slightly from the ones shown in the paper (which are also reproduced below).
The paths for the slurm bash scripts used for inference are shown in the table below. Note that the scripts may need to be modified according to the cluster environment.
The language model we used in this work can be found here.
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 47.0 | Download | scripts/vsr/lrs3_trainval/base_lrs3.sh |
Base | LRS3+Vox2-en | 40.2 | Download | scripts/vsr/lrs3_trainval/base_lrs3vox2.sh |
Large | LRS3+Vox2-en | 32.5 | Download | scripts/vsr/lrs3_trainval/large_lrs3vox2.sh |
Large w/ ST | LRS3+Vox2-en | 24.8 | Download | scripts/vsr/lrs3_trainval/large_lrs3vox2_self.sh |
Large w/ ST + LM | LRS3+Vox2-en | 23.8 | same as last row | scripts/vsr/lrs3_trainval/large_lrs3vox2_self_lm.sh |
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 43.4 | Download | scripts/vsr/lrs3_trainval/base_lrs3_braven.sh |
Base Plus | LRS3+Vox2-en | 35.1 | Download | scripts/vsr/lrs3_trainval/baseplus_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en | 30.8 | Download | scripts/vsr/lrs3_trainval/large_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en+AVS | 24.8 | Download | scripts/vsr/lrs3_trainval/large_lrs3vox2avs_braven.sh |
Large w/ ST | LRS3+Vox2-en+AVS | 21.3 | Download | scripts/vsr/lrs3_trainval/large_lrs3vox2avs_self_braven.sh |
Large w/ ST + LM | LRS3+Vox2-en+AVS | 20.0 | same as last row | scripts/vsr/lrs3_trainval/large_lrs3vox2avs_self_lm_braven.sh |
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 39.1 | Download | scripts/vsr/lrs3/base_lrs3.sh |
Base | LRS3+Vox2-en | 33.1 | Download | scripts/vsr/lrs3/base_lrs3vox2.sh |
Large | LRS3+Vox2-en | 27.8 | Download | scripts/vsr/lrs3/large_lrs3vox2.sh |
Large w/ ST | LRS3+Vox2-en | 24.4 | Download | scripts/vsr/lrs3/large_lrs3vox2_self.sh |
Large w/ ST + LM | LRS3+Vox2-en | 23.1 | same as last row | scripts/vsr/lrs3/large_lrs3vox2_self_lm.sh |
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 36.0 | Download | scripts/vsr/lrs3/base_lrs3_braven.sh |
Base Plus | LRS3+Vox2-en | 28.8 | Download | scripts/vsr/lrs3/baseplus_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en | 26.6 | Download | scripts/vsr/lrs3/large_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en+AVS | 23.6 | Download | scripts/vsr/lrs3/large_lrs3vox2avs_braven.sh |
Large w/ ST | LRS3+Vox2-en+AVS | 20.9 | Download | scripts/vsr/lrs3/large_lrs3vox2avs_self_braven.sh |
Large w/ ST + LM | LRS3+Vox2-en+AVS | 20.1 | same as last row | scripts/vsr/lrs3/large_lrs3vox2avs_self_lm_braven.sh |
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 4.7 | Download | scripts/asr/lrs3_trainval/base_lrs3.sh |
Base | LRS3+Vox2-en | 3.8 | Download | scripts/asr/lrs3_trainval/base_lrs3vox2.sh |
Large | LRS3+Vox2-en | 2.7 | Download | scripts/asr/lrs3_trainval/large_lrs3vox2.sh |
Large w/ ST | LRS3+Vox2-en | 2.3 | Download | scripts/asr/lrs3_trainval/large_lrs3vox2_self.sh |
Large w/ ST + LM | LRS3+Vox2-en | 1.9 | same as last row | scripts/asr/lrs3_trainval/large_lrs3vox2_self_lm.sh |
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 4.0 | Download | scripts/asr/lrs3_trainval/base_lrs3_braven.sh |
Base Plus | LRS3+Vox2-en | 3.0 | Download | scripts/asr/lrs3_trainval/baseplus_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en | 2.3 | Download | scripts/asr/lrs3_trainval/large_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en+AVS | 2.1 | Download | scripts/asr/lrs3_trainval/large_lrs3vox2avs_braven.sh |
Large w/ ST | LRS3+Vox2-en+AVS | 1.9 | Download | scripts/asr/lrs3_trainval/large_lrs3vox2avs_self_braven.sh |
Large w/ ST + LM | LRS3+Vox2-en+AVS | 1.7 | same as last row | scripts/asr/lrs3_trainval/large_lrs3vox2avs_self_lm_braven.sh |
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 2.2 | Download | scripts/asr/lrs3/base_lrs3.sh |
Base | LRS3+Vox2-en | 1.9 | Download | scripts/asr/lrs3/base_lrs3vox2.sh |
Large | LRS3+Vox2-en | 1.4 | Download | scripts/asr/lrs3/large_lrs3vox2.sh |
Large w/ ST | LRS3+Vox2-en | 1.4 | Download | scripts/asr/lrs3/large_lrs3vox2_self.sh |
Large w/ ST + LM | LRS3+Vox2-en | 1.4 | same as last row | scripts/asr/lrs3/large_lrs3vox2_self_lm.sh |
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 1.9 | Download | scripts/asr/lrs3/base_lrs3_braven.sh |
Base Plus | LRS3+Vox2-en | 1.4 | Download | scripts/asr/lrs3/baseplus_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en | 1.2 | Download | scripts/asr/lrs3/large_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en+AVS | 1.2 | Download | scripts/asr/lrs3/large_lrs3vox2avs_braven.sh |
Large w/ ST | LRS3+Vox2-en+AVS | 1.2 | Download | scripts/asr/lrs3/large_lrs3vox2avs_self_braven.sh |
Large w/ ST + LM | LRS3+Vox2-en+AVS | 1.1 | same as last row | scripts/asr/lrs3/large_lrs3vox2avs_self_lm_braven.sh |
Code for pre-training and fine-tuning coming soon...
If you find this repo useful for your research, please consider citing the following:
@article{haliassos2022jointly,
title={Jointly Learning Visual and Auditory Speech Representations from Raw Data},
author={Haliassos, Alexandros and Ma, Pingchuan and Mira, Rodrigo and Petridis, Stavros and Pantic, Maja},
journal={arXiv preprint arXiv:2212.06246},
year={2022}
}
@inproceedings{haliassos2024braven,
title={BRAVEn: Improving Self-supervised pre-training for Visual and Auditory Speech Recognition},
author={Haliassos, Alexandros and Zinonos, Andreas and Mira, Rodrigo and Petridis, Stavros and Pantic, Maja},
booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={11431--11435},
year={2024},
organization={IEEE}
}