A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation.
MuAViC provides
The raw data is collected from TED/TEDx talk recordings.
| Language | Code | Train Hours (H+P) | Train Speakers | |:---:|:---:|:---:|:---:| | English | En | 436 + 0 | 4.7K | | Arabic | Ar | 16 + 0 | 95 | | German | De | 10 + 0 | 53 | | Greek | El | 25 + 0 | 113 | | Spanish | Es | 178 + 0 | 987 | | French | Fr | 176 + 0 | 948 | | Italian | It | 101 + 0 | 487 | | Portuguese | Pt | 153 + 0 | 810 | | Russian | Ru | 49 + 0 | 238 |
| Direction | Code | Train Hours (H+P) | Train Speakers | |:---:|:---:|:---:|:---:| | English-Greek | En-El | 17 + 420 | 4.7K | | English-Spanish | En-Es | 21 + 416 | 4.7K | | English-French | En-Fr | 21 + 416 | 4.7K | | English-Italian | En-It | 20 + 417 | 4.7K | | English-Portuguese | En-Pt | 18 + 419 | 4.7K | | English-Russian | En-Ru | 20 + 417 | 4.7K |
| Direction | Code | Train Hours (H+P) | Train Speakers | |:---:|:---:|:---:|:---:| | Greek-English | El-En | 8 + 17 | 113 | | Spanish-English | Es-En | 64 + 114 | 987 | | French-English | Fr-En | 45 + 131 | 948 | | Italian-English | It-En | 48 + 53 | 487 | | Portuguese-English | Pt-En | 53 + 100 | 810 | | Russian-English | Ru-En | 8 + 41 | 238 |
We provide scripts to generate the audio/video data and AV-HuBERT training manifests for MuAViC.
First, clone this repo for the scripts
git clone https://github.com/facebookresearch/muavic.git
Install required packages:
conda install -c conda-forge ffmpeg==4.2.2
conda install -c conda-forge sox
pip install -r requirements.txt
Then get audio-visual speech recognition and translation data via
python get_data.py --root-path ${ROOT} --src-lang ${SRC_LANG}
where the speech language ${SRC_LANG}
is one of en
, ar
, de
, el
, es
, fr
, it
, pt
and ru
.
Generated data will be saved to ${ROOT}/muavic
:
${ROOT}/muavic/${SRC_LANG}/audio
for processed audio files${ROOT}/muavic/${SRC_LANG}/video
for processed video files${ROOT}/muavic/${SRC_LANG}/*.tsv
for AV-HuBERT AVSR training manifests${ROOT}/muavic/${SRC_LANG}/${TGT_LANG}/*.tsv
for AV-HuBERT AVST training manifestsIn the following table, we provide all end-to-end trained models mentioned in our paper:
Task | Languages | Best Checkpoint | Dictionary | Tokenizer |
---|---|---|---|---|
AVSR | ar | best_ckpt.pt | dict | tokenizer |
de | best_ckpt.pt | dict | tokenizer | |
el | best_ckpt.pt | dict | tokenizer | |
en | best_ckpt.pt | dict | tokenizer | |
es | best_ckpt.pt | dict | tokenizer | |
fr | best_ckpt.pt | dict | tokenizer | |
it | best_ckpt.pt | dict | tokenizer | |
pt | best_ckpt.pt | dict | tokenizer | |
ru | best_ckpt.pt | dict | tokenizer | |
ar,de,el,es,fr,it,pt,ru | best_ckpt.pt | dict | tokenizer | |
AVST | en-el | best_ckpt.pt | dict | tokenizer |
en-es | best_ckpt.pt | dict | tokenizer | |
en-fr | best_ckpt.pt | dict | tokenizer | |
en-it | best_ckpt.pt | dict | tokenizer | |
en-pt | best_ckpt.pt | dict | tokenizer | |
en-ru | best_ckpt.pt | dict | tokenizer | |
el-en | best_ckpt.pt | dict | tokenizer | |
es-en | best_ckpt.pt | dict | tokenizer | |
fr-en | best_ckpt.pt | dict | tokenizer | |
it-en | best_ckpt.pt | dict | tokenizer | |
pt-en | best_ckpt.pt | dict | tokenizer | |
ru-en | best_ckpt.pt | dict | tokenizer | |
{el,es,fr,it,pt,ru}-en | best_ckpt.pt | dict | tokenizer |
To try out our state-of-the-art audio-visual models with different audio and video inputs, including a recorded video through the webcam or an uploaded video, checkout our demo:
https://github.com/facebookresearch/muavic/assets/15960959/d03df3b0-488c-443c-ba3b-452b1a5765d8
You can read more about our model in the README file in the demo folder.
For training Audio-Visual models, we are going to use AV-HuBERT framework.
Clone and install AV-HuBERT in the root directory:
$ # Clone the "muavic" branch of av_hubert's repo
$ git -b muavic clone https://github.com/facebookresearch/av_hubert.git
$ # Set the fairseq version
$ cd avhubert
$ git submodule init
$ git submodule update
$ # Install av-hubert's requirements
$ pip install -r requirements.txt
$ # Install fairseq
$ cd fairseq
$ pip install --editable ./
Download an AV-HuBERT pre-trained model from here.
Open the training script (scripts/train.sh
) and replace these variables:
# language direction (e.g "en" or "en-fr")
LANG=
# path where output trained models will be located
OUT_PATH=
# path to the downloaded pre-trained model
PRETRAINED_MODEL_PATH=
Run the training script:
$ bash scripts/train.sh
Note:\ All audio-visual models found here used the
large_vox_iter5.pt
pre-trained model.
To evaluate your trained model (or our [trained models]()) against MuAViC, follow these steps:
Open the decoding script (scripts/decode.sh
) and replace these variables:
# language direction (e.g "en" or "en-fr")
LANG=???
# data split (e.g "test" or "valid")
GROUP=???
# inference modality (choices: "audio", "video", "audio,video")
MODALITIES=???
# path to the trained model
MODEL_PATH=???
# path where decoding results and scores will be located
OUT_PATH=???
Run the decoding script:
$ bash scripts/decode.sh
CC-BY-NC 4.0
@article{anwar2023muavic,
title={MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation},
author={Anwar, Mohamed and Shi, Bowen and Goswami, Vedanuj and Hsu, Wei-Ning and Pino, Juan and Wang, Changhan},
journal={arXiv preprint arXiv:2303.00628},
year={2023}
}