NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.84k stars 2.46k forks source link

Using G2P (Grapheme2Phoneme) for ASR #3258

Closed ahkarami closed 2 years ago

ahkarami commented 2 years ago

Hi, Thank you for your great repo. I have 2 questions: 1- Is it better to use G2P for English ASR (or just use the raw text for it)? if yes, what model (G2P) do you suggest for this work? 2- Is it better to use G2P for ASR of other languages or not? (for example Arabic ASR) Best

redoctopus commented 2 years ago

As far as I'm aware we have not formally tried predicting phonemes and comparing those models' accuracy against grapheme-predicting models for any languages. If anyone on the team knows better please correct me.

I have trained a few QuartzNet models on mostly unambiguous phonemes from CMUdict (where heteronyms/homographs were stripped out or disambiguated), but those numbers wouldn't be accurate for general use.

I'd hazard a guess that, if your G2P model was accurate, a phoneme prediction model would perform a little better than a grapheme (raw text) prediction model due to the lack of ambiguity, at least for English. But then you'd have to translate back from phonemes to words (assuming you're not just using the phonemes for something else) and figure out how you want to deal with accents/regional pronunciations, which is another story.

ahkarami commented 2 years ago

Thanks for your great explanation. Best