Practicum-PnG / practicum-png

0 stars 0 forks source link

Investigate voice verification algorithms based on Pattern matching/GMM-HMM #3

Open anup-ahuje opened 4 years ago

darenZheng commented 4 years ago

Brief Introduction

Before the Deep Learning (DL) era for speech recognition, HMM and GMM are two must-learn technology for speech recognition.

Relevant Concepts:

Screen Shot 2019-09-09 at 11 33 45 PM

Steps(in speech recognition):

  1. The language model is about the likelihood of the word sequence.
  2. A pronunciation model can use tables to convert words to phones, or a corpus is already transcribed with phonemes already.
  3. The acoustic model is about modeling a sequence of feature vectors given a sequence of phones instead of words.
  4. The distribution of features for a phone can be modeled with a Gaussian Mixture Model (GMM). We will learn it with training data.
  5. The transition between phones and the corresponding observable can be modeled with the Hidden Markov Model (HMM).

Related Paper List

  1. Speaker Verification Using Adapted Gaussian Mixture Models(2000) [pdf] Remaining Issue: Speaker and channel information are bound together in an unknownway in the current spectral-based features and the performance of these sys-tems degrades when the microphone or acoustic environment changes betweentraining data and recognition data.

  2. Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM(2007) [pdf] The speaker identification result by GMM showed that the pro-posed position-dependent CMN achieved a relative error reduction rate of 64.0% from W/o CMN and 30.2% from the position-independent CMN. Average speaker recognition error rate can reach 0.69%. (Loud speaker and low noise environment required.)

  3. Text-independent speaker recognition by combining speaker-specific GMM with speaker adapted syllable-based HMM(2004) [pdf] The method obtained an accuracy of 98.8% for text-independent speaker identification for three speaking style modes (normal, fast, slow) by using a short test utterance (about 4 seconds). (No noise mentioned)

Open source code references: