Open wkranti opened 6 years ago
MFA doesn't use ANN for the acoustic models yet, someone is currently researching that. The models instead are triphone HMM-GMM, which could theoretically be used in other applications that use Kaldi, but only if they're using HMM-GMM acoustic models as well.
Turns out HMM-DNN outperforms both HMM-GMM and HMM-ANN according to http://ieeexplore.ieee.org/document/6936704/
@wkranti Neural network acoustic models provide large reductions in WER for speech-to-text, but I've yet to get consistent large improvements in segmentation accuracy from them for forced alignment. Presumably because:
Mainly, I feel, the latter as when I've performed experiments using corpora where a ground-truth frame-labeling is known, the difference in performance is stark. For instance, for TIMIT even a simple fully-connected feed-forward architecture using monophone models vastly outperforms a GMM-HMM architecture using triphones or boundary models (all scores are percent boundaries within 20 ms tolerance):
triphone GMM-HMM: 90.5% boundary GMM-HMM: 92.1% monophone DNN-HMM: 92.6% boundary DNN-HMM: 93.9%
I suppose you could try bootstrapping the neural network acoustic models like Google did a few years ago (can't remember the paper), but I've never made a serious attempt at this.
Thanks for chiming in, Neville! When you say boundary models, are those just predicting whether or not there's a boundary without modelling phones?
Nope, this would be an approach to modeling in which special boundary models are inserted between each phone. The usual recipe, using Kaldi, would be to create a new fst, call it B, that composes with the lexicon to insert a boundary symbol in between each phone. This boundary symbol is mapped to a 1-state model without self transitions. You start with a single boundary for all phone-phone pairs, then split it using phonetic decision trees. For careful read speech like TIMIT this gives a modest boost in phone-level accuracy over monophone models and somewhat less of a boost over triphone models.
Though, it's still not clear to me if boundary models are a good fit for conversational speech. I suspect they don't handle reductions as well, especially when using a 10 ms step; at least, this is my impression from qualitative assessment of results on conversational speech. I've not had time to work on forced alignment for quite some time, so haven't done a careful comparison on a decent sized corpus of conversational speech with ground truth labeling.
Thanks @nryant and @mmcauliffe for your insight.
@wkranti Neural network acoustic models provide large reductions in WER for speech-to-text, but I've yet to get consistent large improvements in segmentation accuracy from them for forced alignment. Presumably because:
- the search space is already massively constrained by knowing the transcription
- the neural network acoustic models tend to overtrain to the GMM-HMM bootstrapped labels
Mainly, I feel, the latter as when I've performed experiments using corpora where a ground-truth frame-labeling is known, the difference in performance is stark. For instance, for TIMIT even a simple fully-connected feed-forward architecture using monophone models vastly outperforms a GMM-HMM architecture using triphones or boundary models (all scores are percent boundaries within 20 ms tolerance):
triphone GMM-HMM: 90.5% boundary GMM-HMM: 92.1% monophone DNN-HMM: 92.6% boundary DNN-HMM: 93.9%
I suppose you could try bootstrapping the neural network acoustic models like Google did a few years ago (can't remember the paper), but I've never made a serious attempt at this.
Hi @nryant , I'm trying to replicate the same effect (i.e. DNN would yield a significant improvement). Could you point me where I can find your detailed work on this? Thanks!
Hi @whusym,
The models and related scripts are part of an internal codebase at LDC developed circa 2014 using the Kaldi nnet2 framework. Abstracting from some of the complexities introduced by a misguided effort to support pronunciation modeling and multiple types of boundary models, the approach was as follows:
So, a fairly standard hybrid architecture. When trained/tested on TIMIT there was generally no benefit to neural network acoustic models as they tended to learn the same biases and produce the same errors as the original GMM-HMM model. On the other hand, if initialized from a ground-truth state sequence (obtained by performing an equal alignment separately for each TIMIT phone, anchored by reference phone onset/offset), you do get an advantage.
Most likely, what is needed is to dispense with the GMM-HMM training and bootstrap directly using the neural network acoustic models, either like Google did in this 2014 InterSpeech paper:
https://storage.googleapis.com/pub-tools-public-publication-data/pdf/42653.pdf
or using lattice-free MMI:
http://www.danielpovey.com/files/2018_interspeech_end2end.pdf
I'm actually getting ready to revisit the question of forced alignment in a more modern context and will have a look at the latter approach to see if it bears fruit, especially with massive amounts of training data.
I want to know whether we can use pretrained model of language other than english in different Aligner with same Kaldi toolkit uses ANN as Acoustic model?