[Question] How are tones handled?

The source of grouping tonal-variants is based on the recommendation in https://kaldi-asr.org/doc/tree_externals.html. The effect of this is that tonal-variants will share data across the different phones to address data sparsity issues. This ensures that there isn't too little data in particular vowel-tone combos to not be able to model the vowels properly, as that's the primary aspect contributing to alignment. But while they share the same root of the decision tree, each phone can have different leaves (leading to sequences of PDFs), so the tone can certainly have an impact on the alignment.

One thing to note is that the tonal-variants should generally only differ in their pitch features, not in their MFCC features, which only have 13 coefficients and won't include any information from harmonics in the spectrum. For MFA models by default, it's not using raw f0, but rather a normalized log-pitch over a 1.5 second window (see the process-kaldi-pitch-feats binary docs for more details), along with a probability-of-voicing feature.

MontrealCorpusTools / Montreal-Forced-Aligner

[Question] How are tones handled? #816