Open rafaelvalle opened 4 years ago
I’m also trying to find the same information. Having run an algorithm a couple of times it seems like the choice is not random. Is there some model for this or does mfa choose the right pronunciation based on an acoustic model somehow?
Do you mean the case where the word A
has multiple transcriptions? In the above example, it should always pick A
because A(1)
is not considered a pronunciation variant (different orthography)
For the case of
A EY1
A AH1
The pronunciation weights are equal, so over the course of training, the acoustic model will start carrying the bulk for deciding between them. You can also specify probabilities between different word forms, i.e.:
A 0.25 EY1
A 1 AH1
Note that they don't sum to one. The convention is to make the highest probability pronunciation have 1, so that it isn't penalized for having many variants and reducing accuracy.
You can also estimate them from a speech corpus with the mfa train_dictionary
command (https://montreal-forced-aligner.readthedocs.io/en/latest/training_dictionary.html#training-dictionary). It will align the corpus and estimate the probabilities from the counts of pronunciations that the aligner picked when aligning.
can you elaborate on the bit 'acoustic model will start carrying the bulk for deciding between them'? The acoustic models are GMM-HMM model right? so gaussian mixture model for the audio part and hidden markov model for picking the most likely sequence of phonemes to decide which variant of phoneme for the word to use? thanks
How does the binary 'lib/align' handle entries in the dictionary with multiple values ? For example: