How to infer with my own Data

YuanGongND / gopt

Code for the ICASSP 2022 paper "Transformer-Based Multi-Aspect Multi-Granularity Non-native English Speaker Pronunciation Assessment".

BSD 3-Clause "New" or "Revised" License

153 stars 28 forks source link

How to infer with my own Data #2

Closed aidenpearce001 closed 1 year ago

aidenpearce001 commented 2 years ago

How can i infernce with custom data

YuanGongND commented 2 years ago

Hi there,

You basically need to follow the local script to first use the Kaldi so762 recipe (with the same librispeech AM as we used, which is a public one) to generate the GOP features for your own data. This step requires familiarity with Kaldi. But really it is just replacing the original test set with your own data.

Then you can load our pretrained GOPT model trained with the librispeech AM and do inference. This is quite straightforward.

-Yuan

TheoSeo93 commented 2 years ago

I'm also curious about this. It seems like I have to run the shell script from the so762 recipe to try some inference from custom data. But does it also mean that I can start from recording some pronunciations for myself and extract the feature files by using the kaldi recipe? If so, do I just replace the .wav files with my own recordings in the so762 and run the recipe?

Thank you Theo Seo

YuanGongND commented 2 years ago

It seems like I have to run the shell script from the so762 recipe to try some inference from custom data.

Yes, you have to run the Kaldi so762 recipe to get GOP features. FYI, GOP feature relies on an ASR model, so Kaldi might be the best tool to run.

But does it also mean that I can start from recording some pronunciations for myself and extract the feature files by using the kaldi recipe? If so, do I just replace the .wav files with my own recordings in the so762 and run the recipe?

Yes, but you need to make the format consistent with the original so762 dataset, in particular, your wav files need to be either have the same content as the original wav files, or you need to prepare the canonical transcripts (i.e., the ground truth text). But once you have your dataset ready, you should be able to generate the GOP features.

-Yuan

TheoSeo93 commented 2 years ago

Thanks for the reply !

Yes, but you need to make the format consistent with the original so762 dataset, in particular, your wav files need to be either have the same content as the original wav files, or you need to prepare the canonical transcripts (i.e., the ground truth text). But once you have your dataset ready, you should be able to generate the GOP features.

I have another question. If I want to infer my pronunciation from recordings from the pretrained model, all I need would be the GOP features extracted by the recipe. The thing is, I'm still confused with GOP feature extraction.. It seems like the dataset should also have the human-annotated scores of pronunciation to align with the format of so762 or is it not necessary?

Thanks again. Theo Seo

YuanGongND commented 2 years ago

The thing is, I'm still confused with GOP feature extraction..

Kaldi certainly has a learning curve - you could raise an issue at the Kaldi repo.

It seems like the dataset should also have the human-annotated scores of pronunciation to align with the format of so762 or is it not necessary

No, I don't think so, those scores are just for evaluation, the Kaldi recipe does not need to know the ground truth score to make a prediction. Specifically, I think you can stop at https://github.com/kaldi-asr/kaldi/blob/9af2c5c16389e141f527ebde7ee432a0c1df9fb9/egs/gop_speechocean762/s5/run.sh#L152-L185 and ignore the last two stages if your goal is just to get GOP features.

Rtut654 commented 1 year ago

@TheoSeo93 Hi ! Have you succeed with the inference on your own data?