MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.26k stars 242 forks source link

Have some way to model stress in Japanese MFA #810

Open leandro-gracia-gil opened 1 month ago

leandro-gracia-gil commented 1 month ago

A typical example where stress is important in Japanese is 橋 (bridge, read as はし, hashi) vs 箸 (chopsticks, read also as はし, hashi). The former (橋) has stress in the second syllable, where the latter (箸) has stress in the first syllable.

Currently, the Japanese MFA G2P model seems to produce "h a ɕ i" for both. The Japanese MFA v3.0.0 dictionary is similar, although it lists the last phoneme for 箸 as voiceless in its most likely pronunciation ("h a ɕ i̥"). The first syllable is the same though, which is the one being stressed.

Is there any way to support modelling these differences in stress in the generated IPA phonemes, or in some other way? I see there are primary and secondary stress symbols in IPA, but I don't know enough to judge if this would be a good approach or not.

mmcauliffe commented 3 weeks ago

So it's a bit tricky to really do this in Japanese, since it's a pitch-accent language. So the difference in the two words is which syllable has higher versus lower pitch, but each word will have the same loudness/length for the two syllables (taking into account things like speaker, speech rate, word frequency, focus, etc etc). In a stress language like English, stress has large effects on vowel quality, with vowels that only appear in stressed syllables and vowels that only appear in unstressed syllables, so word pairs like "proJECT" (i.e. to project an image on a screen) vs "PROJect" (i.e. a project that you work on), are going to have different vowels in the first syllable in addition to the syllable level differences in length, loudness, and pitch.

With that said, for the MFA models in particular, there's no pitch/voicing features used currently in acoustic models, though there is use of devoiced vowels as you've mentioned. Those are generated through the phonological rules at acoustic model train time (see japanese phonological rule config, with the G2P model generating more "citation" forms.

You can use the wiktionary entries for and to get the relevant pitch accent, and then you can apply the devoicing rules to the non-high pitch syllables? Though I don't think I've seen any literature that mentions that the high vowel devoicing is dependent on pitch accent, but it makes sense that if a syllable has high pitch, it wouldn't be devoiced. Additionally, you can see the calculated probability of application from the devoicing rules from the rules.yaml file from the japanese_mfa acoustic model:

- following_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    non_silence_before_correction: 0.0
    preceding_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    probability: 0.32
    replacement: i̥
    segment: i
    silence_after_probability: 0.82
    silence_before_correction: 0.01
  - following_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    non_silence_before_correction: 0.03
    preceding_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    probability: 0.89
    replacement: ''
    segment: i
    silence_after_probability: 1.65
    silence_before_correction: -0.01
  - following_context: $
    non_silence_before_correction: 0.08
    preceding_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    probability: 0.1
    replacement: i̥
    segment: i
    silence_after_probability: 0.23
    silence_before_correction: -0.33
  - following_context: $
    non_silence_before_correction: 0.08
    preceding_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    probability: 0.23
    replacement: ''
    segment: i
    silence_after_probability: 0.46
    silence_before_correction: -0.26
  - following_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    non_silence_before_correction: -0.01
    preceding_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    probability: 0.29
    replacement: ɯ̥
    segment: ɯ
    silence_after_probability: 1.36
    silence_before_correction: 0.05
  - following_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    non_silence_before_correction: -0.01
    preceding_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    probability: 0.69
    replacement: ''
    segment: ɯ
    silence_after_probability: 1.0
    silence_before_correction: 0.14
  - following_context: $
    non_silence_before_correction: -0.02
    preceding_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    probability: 0.25
    replacement: ɯ̥
    segment: ɯ
    silence_after_probability: 23.0
    silence_before_correction: 0.02
  - following_context: $
    non_silence_before_correction: 0.04
    preceding_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    probability: 0.24
    replacement: ''
    segment: ɯ
    silence_after_probability: 5.5
    silence_before_correction: -0.19
  - following_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    non_silence_before_correction: -0.01
    preceding_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    probability: 0.47
    replacement: ɨ̥
    segment: ɨ
    silence_after_probability: 0.94
    silence_before_correction: 0.03
  - following_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    non_silence_before_correction: -0.01
    preceding_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    probability: 0.82
    replacement: ''
    segment: ɨ
    silence_after_probability: 0.76
    silence_before_correction: 0.1
  - following_context: $
    non_silence_before_correction: -0.02
    preceding_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    probability: 0.35
    replacement: ɨ̥
    segment: ɨ
    silence_after_probability: 8.25
    silence_before_correction: -0.04
  - following_context: $
    non_silence_before_correction: 0.05
    preceding_context: '[skctɕɸçhp][ʲsɕ]?ː?'
    probability: 0.34
    replacement: ''
    segment: ɨ
    silence_after_probability: 3.38
    silence_before_correction: -0.17

If your end goal is to analyze differences in pronunciations between the different pitch accent patterns, it might be worth just generating a dedicated resource from wiktionary that maps words to their pattern, though you'd have do some modifications to a scraping script like wikipron, since the IPA transcription doesn't contain any pitch accent information.

leandro-gracia-gil commented 3 weeks ago

Thank you very much for the very detailed answer! It seems that, at the very least, I have been mixing the concepts of pitch accents and stressing.

My end goal would be to get some different phoneme representations for these 2 cases from alignment results and from dictionaries, and evaluate alternatives if not possible. That's why I was thinking about IPA stress symbols. But if I understood you correctly, it seems that for Japanese this would rather be pitch accents, which have no IPA representation.

Thanks for the WikiPron link, it surely sounds useful to gather pronunciation data from Wikitionary. Maybe I can combine this information with the IPA phonemes to get what I need, although I would need to figure out how to exactly map and combine these two pieces of information (e.g, ha̠ɕi and háꜜshì).

Feel free to close this feature request if you don't have any further comments. Again, thank you very much for the elaborate answer!

leandro-gracia-gil commented 3 weeks ago

By the way, are other languages in MFA that use stress, like English, currently representing this information in the IPA phonemes they use in dictionaries and returned alignments? (the PROject vs proJECT case you mentioned)