MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.34k stars 248 forks source link

How to use pretrained models of MFA in different Aligner #56

Open wkranti opened 6 years ago

wkranti commented 6 years ago

I want to know whether we can use pretrained model of language other than english in different Aligner with same Kaldi toolkit uses ANN as Acoustic model?

mmcauliffe commented 6 years ago

MFA doesn't use ANN for the acoustic models yet, someone is currently researching that. The models instead are triphone HMM-GMM, which could theoretically be used in other applications that use Kaldi, but only if they're using HMM-GMM acoustic models as well.

wkranti commented 6 years ago

Turns out HMM-DNN outperforms both HMM-GMM and HMM-ANN according to http://ieeexplore.ieee.org/document/6936704/

nryant commented 6 years ago

@wkranti Neural network acoustic models provide large reductions in WER for speech-to-text, but I've yet to get consistent large improvements in segmentation accuracy from them for forced alignment. Presumably because:

Mainly, I feel, the latter as when I've performed experiments using corpora where a ground-truth frame-labeling is known, the difference in performance is stark. For instance, for TIMIT even a simple fully-connected feed-forward architecture using monophone models vastly outperforms a GMM-HMM architecture using triphones or boundary models (all scores are percent boundaries within 20 ms tolerance):

triphone GMM-HMM: 90.5% boundary GMM-HMM: 92.1% monophone DNN-HMM: 92.6% boundary DNN-HMM: 93.9%

I suppose you could try bootstrapping the neural network acoustic models like Google did a few years ago (can't remember the paper), but I've never made a serious attempt at this.

mmcauliffe commented 6 years ago

Thanks for chiming in, Neville! When you say boundary models, are those just predicting whether or not there's a boundary without modelling phones?

nryant commented 6 years ago

Nope, this would be an approach to modeling in which special boundary models are inserted between each phone. The usual recipe, using Kaldi, would be to create a new fst, call it B, that composes with the lexicon to insert a boundary symbol in between each phone. This boundary symbol is mapped to a 1-state model without self transitions. You start with a single boundary for all phone-phone pairs, then split it using phonetic decision trees. For careful read speech like TIMIT this gives a modest boost in phone-level accuracy over monophone models and somewhat less of a boost over triphone models.

Though, it's still not clear to me if boundary models are a good fit for conversational speech. I suspect they don't handle reductions as well, especially when using a 10 ms step; at least, this is my impression from qualitative assessment of results on conversational speech. I've not had time to work on forced alignment for quite some time, so haven't done a careful comparison on a decent sized corpus of conversational speech with ground truth labeling.

wkranti commented 6 years ago

Thanks @nryant and @mmcauliffe for your insight.

whusym commented 5 years ago

@wkranti Neural network acoustic models provide large reductions in WER for speech-to-text, but I've yet to get consistent large improvements in segmentation accuracy from them for forced alignment. Presumably because:

  • the search space is already massively constrained by knowing the transcription
  • the neural network acoustic models tend to overtrain to the GMM-HMM bootstrapped labels

Mainly, I feel, the latter as when I've performed experiments using corpora where a ground-truth frame-labeling is known, the difference in performance is stark. For instance, for TIMIT even a simple fully-connected feed-forward architecture using monophone models vastly outperforms a GMM-HMM architecture using triphones or boundary models (all scores are percent boundaries within 20 ms tolerance):

triphone GMM-HMM: 90.5% boundary GMM-HMM: 92.1% monophone DNN-HMM: 92.6% boundary DNN-HMM: 93.9%

I suppose you could try bootstrapping the neural network acoustic models like Google did a few years ago (can't remember the paper), but I've never made a serious attempt at this.

Hi @nryant , I'm trying to replicate the same effect (i.e. DNN would yield a significant improvement). Could you point me where I can find your detailed work on this? Thanks!

nryant commented 5 years ago

Hi @whusym,

The models and related scripts are part of an internal codebase at LDC developed circa 2014 using the Kaldi nnet2 framework. Abstracting from some of the complexities introduced by a misguided effort to support pronunciation modeling and multiple types of boundary models, the approach was as follows:

So, a fairly standard hybrid architecture. When trained/tested on TIMIT there was generally no benefit to neural network acoustic models as they tended to learn the same biases and produce the same errors as the original GMM-HMM model. On the other hand, if initialized from a ground-truth state sequence (obtained by performing an equal alignment separately for each TIMIT phone, anchored by reference phone onset/offset), you do get an advantage.

Most likely, what is needed is to dispense with the GMM-HMM training and bootstrap directly using the neural network acoustic models, either like Google did in this 2014 InterSpeech paper:

https://storage.googleapis.com/pub-tools-public-publication-data/pdf/42653.pdf

or using lattice-free MMI:

http://www.danielpovey.com/files/2018_interspeech_end2end.pdf

I'm actually getting ready to revisit the question of forced alignment in a more modern context and will have a look at the latter approach to see if it bears fruit, especially with massive amounts of training data.