MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.26k stars 242 forks source link

[Question] How are tones handled? #816

Open lars76 opened 3 weeks ago

lars76 commented 3 weeks ago

I noticed that the default Mandarin acoustic model uses phone groups to combine tones. Does this mean that tones have no effect on the alignment?

mmcauliffe commented 3 weeks ago

The source of grouping tonal-variants is based on the recommendation in https://kaldi-asr.org/doc/tree_externals.html. The effect of this is that tonal-variants will share data across the different phones to address data sparsity issues. This ensures that there isn't too little data in particular vowel-tone combos to not be able to model the vowels properly, as that's the primary aspect contributing to alignment. But while they share the same root of the decision tree, each phone can have different leaves (leading to sequences of PDFs), so the tone can certainly have an impact on the alignment.

One thing to note is that the tonal-variants should generally only differ in their pitch features, not in their MFCC features, which only have 13 coefficients and won't include any information from harmonics in the spectrum. For MFA models by default, it's not using raw f0, but rather a normalized log-pitch over a 1.5 second window (see the process-kaldi-pitch-feats binary docs for more details), along with a probability-of-voicing feature.