Before the Deep Learning (DL) era for speech recognition, HMM and GMM are two must-learn technology for speech recognition.
Relevant Concepts:
pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be a match." The patterns generally have the form of either sequences or tree structures.
Gaussian Mixture Model(GMM) is a probabilistic model for representing normally distributed subpopulations within an overall population. Mixture models in general don't require knowing which subpopulation a data point belongs to, allowing the model to learn the subpopulations automatically. (application in Python)
Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobservable (i.e. hidden) states. In simpler Markov models (like a Markov chain), the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters, while in the hidden Markov model, the state is not directly visible, but the output dependent on the state is visible.
A hidden Markov model can be considered a generalization of a mixture model where the hidden variables, which control the mixture component to be selected for each observation, are related through a Markov process rather than independent of each other.
The language model is about the likelihood of the word sequence.
A pronunciation model can use tables to convert words to phones, or a corpus is already transcribed with phonemes already.
The acoustic model is about modeling a sequence of feature vectors given a sequence of phones instead of words.
The distribution of features for a phone can be modeled with a Gaussian Mixture Model (GMM). We will learn it with training data.
The transition between phones and the corresponding observable can be modeled with the Hidden Markov Model (HMM).
Related Paper List
Speaker Verification Using Adapted Gaussian Mixture Models(2000) [pdf]
Remaining Issue: Speaker and channel information are bound together in an unknownway in the current spectral-based features and the performance of these sys-tems degrades when the microphone or acoustic environment changes betweentraining data and recognition data.
Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM(2007) [pdf]
The speaker identification result by GMM showed that the pro-posed position-dependent CMN achieved a relative error reduction rate of 64.0% from W/o CMN and 30.2% from the position-independent CMN. Average speaker recognition error rate can reach 0.69%. (Loud speaker and low noise environment required.)
Text-independent speaker recognition by combining speaker-specific GMM with speaker adapted syllable-based HMM(2004) [pdf]
The method obtained an accuracy of 98.8% for text-independent speaker identification for three speaking style modes (normal, fast, slow) by using a short test utterance (about 4 seconds). (No noise mentioned)
Brief Introduction
Before the Deep Learning (DL) era for speech recognition, HMM and GMM are two must-learn technology for speech recognition.
Relevant Concepts:
pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be a match." The patterns generally have the form of either sequences or tree structures.
Gaussian Mixture Model(GMM) is a probabilistic model for representing normally distributed subpopulations within an overall population. Mixture models in general don't require knowing which subpopulation a data point belongs to, allowing the model to learn the subpopulations automatically. (application in Python)
Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobservable (i.e. hidden) states. In simpler Markov models (like a Markov chain), the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters, while in the hidden Markov model, the state is not directly visible, but the output dependent on the state is visible.
A hidden Markov model can be considered a generalization of a mixture model where the hidden variables, which control the mixture component to be selected for each observation, are related through a Markov process rather than independent of each other.
Steps(in speech recognition):
Related Paper List
Speaker Verification Using Adapted Gaussian Mixture Models(2000) [pdf] Remaining Issue: Speaker and channel information are bound together in an unknownway in the current spectral-based features and the performance of these sys-tems degrades when the microphone or acoustic environment changes betweentraining data and recognition data.
Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM(2007) [pdf] The speaker identification result by GMM showed that the pro-posed position-dependent CMN achieved a relative error reduction rate of 64.0% from W/o CMN and 30.2% from the position-independent CMN. Average speaker recognition error rate can reach 0.69%. (Loud speaker and low noise environment required.)
Text-independent speaker recognition by combining speaker-specific GMM with speaker adapted syllable-based HMM(2004) [pdf] The method obtained an accuracy of 98.8% for text-independent speaker identification for three speaking style modes (normal, fast, slow) by using a short test utterance (about 4 seconds). (No noise mentioned)
Open source code references: