MontrealCorpusTools / mfa-models

Collection of pretrained models for the Montreal Forced Aligner
Creative Commons Attribution 4.0 International
103 stars 19 forks source link

Dictionary format is unclear. #8

Closed AndreyBocharnikov closed 2 years ago

AndreyBocharnikov commented 2 years ago

Hello and thank you for your work.

I was working with russian_mfa.dict (downloaded via mfa model download dictionary russian_mfa) and its format seems unclear: typically ноутбуков 1 0.0 0.0 0.0 n̪ o ʊ d̪ b u k ə f, I understand that it is a word and its phonemes at the very beginning of the line and at the end respectively, and the first number is some probability, but I can't figure it out what does the other 3 numbers mean. I looked at the documentation here, but there is nothing about format :(

It's important because the output of mfa g2p russian_mfa oov.txt oov_phonemes.txt has the following format жбанков ('ʐ', 'b', 'a', 'n̪', 'k', 'ə', 'f') and it's unclear how to merge existing dictionary with oov words, because the formats are different.

Could you please explain what the format is russian_mfa.dict or where to read about it. Best wishes

mmcauliffe commented 2 years ago

The released dictionaries have pronunciation probabilities and silence probabilities encoded. As you noted, the first number is indeed the pronunciation probability, normalized by the maximum count for a given word, the second number is the probability that silence will follow this pronunciation, and the final two numbers are corrective factors for how likely a pronunciation is to have silence before it. You can read more details about in the dictionary format page of the MFA docs. I've also updated the model cards to note the format and added links to the non-probabilistic versions of each dictionary.