ahaliassos / raven

Official implementation of RAVEn (ICLR 2023) and BRAVEn (ICASSP 2024)
MIT License
53 stars 3 forks source link

(B)RAVEn: A PyTorch Lightning Implementation

Introduction

We provide code for the reproduction of the main results in Jointly Learning Visual and Auditory Speech Representations from Raw Data and BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition . Our implementation is based on PyTorch Lightning.

Preparation

Installation

conda env create -f environment.yml. Change the environment prefix to match the location of miniconda3, if necessary.

Data

  1. The datasets used in the paper can be downloaded from the following links:
  2. Compute 68 landmarks per frame using e.g., RetinaFace and 2-D FAN, or download them e.g., from this repo. Each landmark file should have the same name as its corresponding video (except that it ends in .npy).
  3. Use the following command to crop the mouths:
    python preprocessing/extract_mouths.py --src_dir ${SOURCE_DIR} --tgt_dir ${TARGET_DIR} --landmarks_dir ${LANDMARKS_DIR}

RAVEn pre-trained models

Below are the checkpoints of the Base and Large models pre-trained with RAVEn on LRS3+Vox2-en.

Model Modality Checkpoint
Base Video Download
Base Audio Download
Large Video Download
Large Audio Download

BRAVEn pre-trained models

Below are the checkpoints of the Base, Base+, and Large models pre-trained with BRAVEn.

Model Modality Checkpoint
Base (LRS3) Video Download
Base (LRS3) Audio Download
Base+ (LRS3+Vox2) Video Download
Base+ (LRS3+Vox2) Audio Download
Large (LRS3+Vox2+AVS) Video Download
Large (LRS3+Vox2+AVS) Audio Download

Testing

VSR

RAVEn low-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 47.0 Download scripts/vsr/lrs3_trainval/base_lrs3.sh
Base LRS3+Vox2-en 40.2 Download scripts/vsr/lrs3_trainval/base_lrs3vox2.sh
Large LRS3+Vox2-en 32.5 Download scripts/vsr/lrs3_trainval/large_lrs3vox2.sh
Large w/ ST LRS3+Vox2-en 24.8 Download scripts/vsr/lrs3_trainval/large_lrs3vox2_self.sh
Large w/ ST + LM LRS3+Vox2-en 23.8 same as last row scripts/vsr/lrs3_trainval/large_lrs3vox2_self_lm.sh

BRAVEn low-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 43.4 Download scripts/vsr/lrs3_trainval/base_lrs3_braven.sh
Base Plus LRS3+Vox2-en 35.1 Download scripts/vsr/lrs3_trainval/baseplus_lrs3vox2_braven.sh
Large LRS3+Vox2-en 30.8 Download scripts/vsr/lrs3_trainval/large_lrs3vox2_braven.sh
Large LRS3+Vox2-en+AVS 24.8 Download scripts/vsr/lrs3_trainval/large_lrs3vox2avs_braven.sh
Large w/ ST LRS3+Vox2-en+AVS 21.3 Download scripts/vsr/lrs3_trainval/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LM LRS3+Vox2-en+AVS 20.0 same as last row scripts/vsr/lrs3_trainval/large_lrs3vox2avs_self_lm_braven.sh

RAVEn high-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 39.1 Download scripts/vsr/lrs3/base_lrs3.sh
Base LRS3+Vox2-en 33.1 Download scripts/vsr/lrs3/base_lrs3vox2.sh
Large LRS3+Vox2-en 27.8 Download scripts/vsr/lrs3/large_lrs3vox2.sh
Large w/ ST LRS3+Vox2-en 24.4 Download scripts/vsr/lrs3/large_lrs3vox2_self.sh
Large w/ ST + LM LRS3+Vox2-en 23.1 same as last row scripts/vsr/lrs3/large_lrs3vox2_self_lm.sh

BRAVEn high-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 36.0 Download scripts/vsr/lrs3/base_lrs3_braven.sh
Base Plus LRS3+Vox2-en 28.8 Download scripts/vsr/lrs3/baseplus_lrs3vox2_braven.sh
Large LRS3+Vox2-en 26.6 Download scripts/vsr/lrs3/large_lrs3vox2_braven.sh
Large LRS3+Vox2-en+AVS 23.6 Download scripts/vsr/lrs3/large_lrs3vox2avs_braven.sh
Large w/ ST LRS3+Vox2-en+AVS 20.9 Download scripts/vsr/lrs3/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LM LRS3+Vox2-en+AVS 20.1 same as last row scripts/vsr/lrs3/large_lrs3vox2avs_self_lm_braven.sh

ASR

RAVEn low-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 4.7 Download scripts/asr/lrs3_trainval/base_lrs3.sh
Base LRS3+Vox2-en 3.8 Download scripts/asr/lrs3_trainval/base_lrs3vox2.sh
Large LRS3+Vox2-en 2.7 Download scripts/asr/lrs3_trainval/large_lrs3vox2.sh
Large w/ ST LRS3+Vox2-en 2.3 Download scripts/asr/lrs3_trainval/large_lrs3vox2_self.sh
Large w/ ST + LM LRS3+Vox2-en 1.9 same as last row scripts/asr/lrs3_trainval/large_lrs3vox2_self_lm.sh

BRAVEn low-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 4.0 Download scripts/asr/lrs3_trainval/base_lrs3_braven.sh
Base Plus LRS3+Vox2-en 3.0 Download scripts/asr/lrs3_trainval/baseplus_lrs3vox2_braven.sh
Large LRS3+Vox2-en 2.3 Download scripts/asr/lrs3_trainval/large_lrs3vox2_braven.sh
Large LRS3+Vox2-en+AVS 2.1 Download scripts/asr/lrs3_trainval/large_lrs3vox2avs_braven.sh
Large w/ ST LRS3+Vox2-en+AVS 1.9 Download scripts/asr/lrs3_trainval/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LM LRS3+Vox2-en+AVS 1.7 same as last row scripts/asr/lrs3_trainval/large_lrs3vox2avs_self_lm_braven.sh

RAVEn high-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 2.2 Download scripts/asr/lrs3/base_lrs3.sh
Base LRS3+Vox2-en 1.9 Download scripts/asr/lrs3/base_lrs3vox2.sh
Large LRS3+Vox2-en 1.4 Download scripts/asr/lrs3/large_lrs3vox2.sh
Large w/ ST LRS3+Vox2-en 1.4 Download scripts/asr/lrs3/large_lrs3vox2_self.sh
Large w/ ST + LM LRS3+Vox2-en 1.4 same as last row scripts/asr/lrs3/large_lrs3vox2_self_lm.sh

BRAVEn high-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 1.9 Download scripts/asr/lrs3/base_lrs3_braven.sh
Base Plus LRS3+Vox2-en 1.4 Download scripts/asr/lrs3/baseplus_lrs3vox2_braven.sh
Large LRS3+Vox2-en 1.2 Download scripts/asr/lrs3/large_lrs3vox2_braven.sh
Large LRS3+Vox2-en+AVS 1.2 Download scripts/asr/lrs3/large_lrs3vox2avs_braven.sh
Large w/ ST LRS3+Vox2-en+AVS 1.2 Download scripts/asr/lrs3/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LM LRS3+Vox2-en+AVS 1.1 same as last row scripts/asr/lrs3/large_lrs3vox2avs_self_lm_braven.sh

Code for pre-training and fine-tuning coming soon...

Citation

If you find this repo useful for your research, please consider citing the following:

@article{haliassos2022jointly,
  title={Jointly Learning Visual and Auditory Speech Representations from Raw Data},
  author={Haliassos, Alexandros and Ma, Pingchuan and Mira, Rodrigo and Petridis, Stavros and Pantic, Maja},
  journal={arXiv preprint arXiv:2212.06246},
  year={2022}
}
@inproceedings{haliassos2024braven,
  title={BRAVEn: Improving Self-supervised pre-training for Visual and Auditory Speech Recognition},
  author={Haliassos, Alexandros and Zinonos, Andreas and Mira, Rodrigo and Petridis, Stavros and Pantic, Maja},
  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={11431--11435},
  year={2024},
  organization={IEEE}
}