YuanGongND / gopt

Code for the ICASSP 2022 paper "Transformer-Based Multi-Aspect Multi-Granularity Non-native English Speaker Pronunciation Assessment".
BSD 3-Clause "New" or "Revised" License
152 stars 28 forks source link

GOPT: Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

News

January 2023, YIFAN WANG contributed a step-by-step tutorial on how to do inference with your own data, see [Here]. However, a bug has been reported and needs to be addressed before use.

Introduction

Illustration of GOPT.

Illustration of GOPT.

This repository contains the official implementation and pretrained model (in PyTorch) of the Goodness Of Pronunciation Feature-Based Transformer (GOPT) proposed in the ICASSP 2022 paper Transformer-Based Multi-Aspect Multi-Granularity Non-native English Speaker Pronunciation Assessment (Yuan Gong, Ziyi Chen, Iek-Heng Chu, Peng Chang, James Glass; MIT & PAII).

GOPT is the first model to simultaneously consider multiple pronunciation quality aspects (accuracy, fluency, prosody, etc) along with multiple granularities (phoneme, word, utterance). With a public automatic speech recognition (ASR) model, it achieves 0.612 phone-level Pearson correlation coefficient (PCC), 0.549 word-level PCC, and 0.742 sentence-level PCC, all are the best results on SpeechOcean762. PWC PWC PWC

We intend to make our results easy to reproduce, specifically, we provide our Kaldi intermediate outputs so that you can reproduce our result without Kaldi in almost one-click (just download our Kaldi output and ./run.sh, or even more simpler, run the Google Colab script Open In Colab).

Citing

Please cite our paper if you find this repository useful.

@INPROCEEDINGS{gong_gopt,
  author={Gong, Yuan and Chen, Ziyi and Chu, Iek-Heng and Chang, Peng and Glass, James},
  booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment}, 
  year={2022},
  pages={7262-7266},
  doi={10.1109/ICASSP43922.2022.9746743}}

Data

The SpeechOcean762 dataset used in ths paper is an open dataset licenced with CC BY 4.0. It can be downloaded from this link.

(Google Colab Scripts) Train and evaluate GOPT with speechocean 762 dataset

We provide Google Colab Script Open In Colab for quick test.

What you need:

What you don't need:

Please Note

(Local Scripts) Train and evaluate GOPT with speechocean 762 dataset

The following is a step-by-step instruction of training and evaluating GOPT with the speechocean 762 dataset.

If you are not familiar with Kaldi, or you are not interested in GOPT feature generation, we provide our intermediate GOP features and this recipe is Kaldi-Free (please see below for details). Otherwise if you want to use your own ASR model, you can NOT skip step 1 and 2.

Step 0. Prepare the environment.

Clone or download this repository and set it as the working directory, create a virtual environment and install the dependencies.

# use absolute path of this repo
gopt_path=your_gopt_path
cd $gopt_path
python3 -m venv venv-gopt
source venv-gopt/bin/activate
pip install -r requirements.txt 

Step 1. Prepare the speechocean762 dataset and generate the Godness of Pronunciation (GOP) features.

(This step is Kaldi dependent and require familiarity with Kaldi. You can skip this step and step 2 by using our output of this step (download via the dropbox link or 腾讯微云链接, please see [here] for details.))

Downlod the speechocean762 dataset from [here]. Use your own Kaldi ASR model or public Kaldi ASR model (e.g., the Librispeech ASR Chain Model we used) and run Kaldi GOP recipe following its instruction. After the run finishes, you should see the performance of the baseline model with the ASR model you use.

Then, extract the GOP features from the intermediate files of the Kaldi GOP recipe run.

kaldi_path=your_kaldi_path
cd $gopt_path
mkdir -p data/raw_kaldi_gop/librispeech
cp src/extract_kaldi_gop/{extract_gop_feats.py,extract_gop_feats_word.py} ${kaldi_path}/egs/gop_speechocean762/s5/local/
cd ${kaldi_path}/egs/gop_speechocean762/s5
python local/extract_gop_feats.py
python local/extract_gop_feats_word.py
cd $gopt_path
cp -r ${kaldi_path}/egs/gop_speechocean762/s5/gopt_feats/* data/raw_kaldi_gop/librispeech

For questions regarding the Kaldi recipe (e.g., how to generate GOP feature for a single wav file), please kindly check the issues of the Kaldi GOP recipe at [here].

Step 2. Convert GOP features and labels to sequences

(You can skip this step and step 1 by using our output of this step (download via the dropbox link or 腾讯微云链接, please see [here] for details.))

The Kaldi output GOP features and labels are at phone level. To model pronunciation assessment as a sequence-to-sequence problem, we need to convert the feature to shape like [#utterance, seq_len, feat_dim]. Specifically, we pad all utterance into 50 tokens (phones) with -1, i.e., seq_len=50. The padded tokens are masked out for any metric calculation.

Use the following scripts for this step:

mkdir data/seq_data_librispeech
cd src/prep_data
python gen_seq_data_phn.py
python gen_seq_data_word.py
python gen_seq_data_utt.py

Step 3. Run Training and Evaluation

The entry point of the training and evaluation scripts is gopt/src/run.sh, which calls gopt/src/traintest.py, which then calls gopt/src/models/gopt.py. Just run the following code snippet.

cd gopt/src
(slurm user) sbatch run.sh
(local user) ./run.sh

Results, best model, and predictions will be saved in the exp_dir specified in gopt/src/run.sh.

Pretrained Models

We provide three pretrained models and corresponding training logs. They are in gopt/pretrained_models/.

Phn MSE Phn PCC Word Acc PCC Word Str PCC Word Total PCC Utt Acc PCC Utt Comp PCC Utt Flu PCC Utt Pros PCC Utt Total PCC
GOPT (Librispeech) 0.084 0.616 0.536 0.326 0.552 0.718 0.109 0.756 0.764 0.743
GOPT (PAII-A) 0.069 0.679 0.595 0.150 0.606 0.727 -0.044 0.692 0.695 0.731
GOPT (PAII-B) 0.071 0.664 0.592 0.174 0.602 0.722 0.122 0.721 0.723 0.740

Test Your Own Model with Our Speechocean762 Traning Pipeline

It is extremely easy to train and test your model with our speechocean762 training pipeline and compare it with GOPT. You don't even need Kaldi or any data processing if you plan to use the same ASR models with us.

Specifically, your model need to be in pytorch and take input and generate output in the following format:

Add your model to gopt/src/models/, modify gopt/src/models/__init__.py and gopt/src/traintest.py to include your model. Then just follow the instructions. You can skip step 1 and 2 by using our intermediate data files.

Contact

If you have a question, please bring up an issue (preferred) or send me an email yuangong@mit.edu. For questions regarding the Kaldi recipe (e.g., how to generate GOP feature for a single wav file), please kindly check the issues of the Kaldi GOP recipe at [here].