MontrealCorpusTools / mfa-models

Collection of pretrained models for the Montreal Forced Aligner
Creative Commons Attribution 4.0 International
111 stars 20 forks source link

UA+RU dicts should have accents #26

Open hypnaceae opened 9 months ago

hypnaceae commented 9 months ago

Ukrainian and Russian have many words that are homographs and are disambiguated in speech using syllable stress, or (in text) using context or diacritics.

Example:

до́ма: [ˈdomə]
дома́: [dɐˈma]

This is represented in the MFA dict as:

дома    0.99    0.55    0.56    1.1 d̪ o m ə
дома    0.1 0.44    1.18    0.93    d̪ ɐ m a

It would make sense to include accent markers in dict entries for compatibility with TTS systems that use auto-accenting for disambiguation at runtime - which is all of them, as far as I'm aware. Supplying accents would reduce the inherent ambiguity in the dict and eliminate the unnecessary reliance on probabilistic identification at MFA runtime, for words that are homographs.

Like so:

до́ма   0.99    0.55    0.56    1.1 d̪ o m ə
дома́   0.1 0.44    1.18    0.93    d̪ ɐ m a

Or so:

до+ма   0.99    0.55    0.56    1.1 d̪ o m ə
дома+   0.1 0.44    1.18    0.93    d̪ ɐ m a

Caveat: this would require transcriptions to have accents, so an extra check would need to be added in aligner code - to ignore accents in dict and fallback to probs (i.e the current behaviour) if the transcription is not accented. It is also not entirely trivial to add accents back into the dict properly as a third party - ideally this would be done during dict generation, hence this issue.