ConST: Cross-modal Contrastive Learning for Speech Translation

This is an implementation of NAACL 2022 paper "Cross-modal Contrastive Learning for Speech Translation" (read paper here). The implementation based on fairseq codebase.

CONTRIBUTION: You are also more than welcomed to test our code on your machines, and report feedbacks on results, bugs and performance!

👀 Overview

The motivation of our ConST model is to learn similar representations for semantically similar speech and text.

ConST (1) inherits the advantages of multi-task learning (as shown in our previous paper XSTNet (with code)), (2) while employing a contrastive learning approach to bridge the gap between low-level speech representation and text embedding.

Result on MuST-C En-X dataset

We report case-sensitive detokenized BLEU via sacrebleu toolkit.

Model	En-De	En-Es	En-Fr	En-It	En-Nl	En-Pt	En-Ro	En-Ru	Avg.
ConST-base	25.7	30.4	36.8	26.3	30.6	32.0	24.8	17.3	28.0
ConST-expand	28.3	32.0	38.3	27.2	31.7	33.1	25.6	18.9	29.4

🤗 Huggingface Space Demo available now!

Experience our end-to-end voice translation system on Huggingface Space now! Record a sentence in English and translate it into other languages! You are a TRANSLATOR!

HERE IS THE WEBSITE:

https://huggingface.co/spaces/ReneeYe/ConST-speech2text-translator

P.S. Since huggingface space only provides CPU, it will take 12-20 seconds to inference and generate the translation result.

⬇️ Download Trained Models

The models are trained based on pytorch. You may download all the models at 🤗huggingface model.

Datasets	Model	SPM & Vocab
En-De	Download	SPM model; Vocab
En-Es	Download	SPM model; Vocab
En-Fr	Download	SPM model; Vocab
En-It	Download	SPM model; Vocab
En-Nl	Download	SPM model; Vocab
En-Pt	Download	SPM model; Vocab
En-Ro	Download	SPM model; Vocab
En-Ru	Download	SPM model; Vocab

Training & Generation Instruction

⚙️ Requirements and Installation

PyTorch version >= 1.5.0
Python version >= 3.6

For training new models, you'll also need an NVIDIA GPU and NCCL

git clone git@github.com:ReneeYe/ConST.git
cd ConST
pip3 install -r requirements.txt
pip3 install --editable ./

📉 Pre-processing and Training

The instructions of data pre-processing are here. To train the model, take En-De as an example, you may run:

bash ConST/scripts/train_en2x.sh de checkpoint/model_saved.

🤖️ Inference, Generation and Evaluation

We strongly recommend that you average the checkpoints after you get the best checkpoint with highest BLEU on dev set.

python3 ConST/scripts/average_checkpoints.py --inputs checkpoint/model_saved \
--num-update-checkpoints 10 --checkpoint-upper-bound ${step-to-get-the-best-dev} \
--output ${path-to-averaged-ckpt}

Then generate and evaluate your model.

fairseq-generate data/ --gen-subset tst-COMMON_st --task speech_to_text --prefix-size 1 \
--max-tokens 4000000 --max-source-positions 4000000 --beam 10 \
--config-yaml config_st.yaml  --path ${path-to-averaged-ckpt} \
--scoring sacrebleu

✏️ Citation

@InProceedings{ye2022cross,
  author    = {Rong Ye and Mingxuan Wang and Lei Li},
  booktitle = {Proc. of NAACL},
  title     = {Cross-modal Contrastive Learning for Speech Translation },
  year      = {2022}
}

ReneeYe / ConST

readme