MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.31k stars 244 forks source link

[BUG] Unexpected G2P output for simple words, and multiple candidates should sort by probability, not alphabetic order #743

Open iamanigeeit opened 7 months ago

iamanigeeit commented 7 months ago

Hello,

I have been experimenting with the pretrained English (US) MFA model in Python.

from montreal_forced_aligner.g2p.generator import PyniniGenerator
from montreal_forced_aligner.models import G2PModel, ModelManager
language = "english_us_mfa"

manager = ModelManager()
manager.download_model("g2p", language)

model_path = G2PModel.get_pretrained_path(language)
g2p = PyniniGenerator(g2p_model_path=model_path, num_pronunciations=1)
g2p.setup()

Problems with simple words like my or hehe:

>>> g2p.rewriter('my')
['mʲ i]
>>> g2p.rewriter('hehe')
['h ə']

Sometimes this can be solved by increasing num_pronunciations:

>>> g2p = PyniniGenerator(g2p_model_path=model_path, num_pronunciations=2)
>>> g2p.setup()
>>> g2p.rewriter('my time')
['m aj tʰ aj m', 'm ɑ tʰ aj m', 'm ə tʰ aj m', 'mʲ i tʰ aj m']

The first one happens to be correct but this is a consequence of sorting lexicographically according to

https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/blob/45ef83b07bacd4c1cd256d1bce2aca658b1c9e45/montreal_forced_aligner/g2p/generator.py#L184

So it's wrong in other cases:

>>> g2p.rewriter('you do')
['j ə d ow', 'j ə d ʉː', 'j ʉː d ow', 'j ʉː d ʉː']

And the cross product makes sentences with $n$ words and $k$ possibilities per word take $O(k^n)$, surely not what we want when we only need the best match:

https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/blob/45ef83b07bacd4c1cd256d1bce2aca658b1c9e45/montreal_forced_aligner/g2p/generator.py#L390

Maybe the most straightforward method is to take top_rewrite for each word and just join them?

mmcauliffe commented 7 months ago

So in general I would not recommend using G2P for common words. The grapheme sequence "my" is overwhelmingly going to be pronounced as "mʲ i" because of the sheer number of words that end in "my" like "alchemy", "anatomy", etc, and these words are weighted the same as the word "my". The use case for G2P is to generate pronunciations for low frequency words not the simple words you're using above, since those are all covered by a pronunciation dictionary and behave quite differently than longer and less frequent words. So you'd get better by using dictionary lookups with pronunciation dictionary and only using g2p as a fall back.

iamanigeeit commented 3 months ago

@mmcauliffe Thanks for the reply. I understand everything is geared towards aligning audio, not for other applications. Is there a way to do normal G2P in MFA or do I have to customize G2P to check the dictionary for known words?

This seems to be a natural extension for MFA, since the phoneme definitions are universal (?) and based on actual acoustics. IOf the multilingual G2P models available, MFA is definitely better than espeak-ng, while NeMo G2P requires downloading the entire framework. Not sure about epitran (will try next).