Practicum-PnG / practicum-png

0 stars 0 forks source link

Investigate voice verification algorithms based on ML/DL #4

Open anup-ahuje opened 4 years ago

jingjingeatmore commented 4 years ago

Some good paper reference can be found here: https://github.com/zzw922cn/awesome-speech-recognition-speech-synthesis-papers In Speaker Verification Section

A Brief Summary:

  1. ML/DL based algorithms are most frequently used in text-independent scenarios.
  2. EER (Equal Error Rate) used to evaluate accuracy
  3. Not tested in noisy enough environments (not has high as 90dB)
  4. “DNN based system is more robust to additive noise”

Relevant Concept:

  1. Speaker Verification (SV), is verifying the claimed identity of a speaker by using their voice characteristics as captured by a recording device such as a microphone.
  2. I-Vector and D-Vector (Quoted from Quora, just an informal way to quickly explain what are these concepts): I-vector: a feature that represents the idiosyncratic characteristics of the frame-level features' distributive pattern. I-vector extraction is essentially a dimensionality reduction of the GMM supervector (although the GMM supervector is not extracted when computing the i-vector). It's extracted in a similar manner with the eigenvoice adaptation scheme or the JFA technique, but is extracted per sentence (or input speech sample). Origin: https://groups.csail.mit.edu/sls/publications/2011/Dehak_IEEE_May2011.pdf D-vector is extracted using DNN. To extract a d-vector, a DNN model that takes stacked filterbank features (similar to the DNN acoustic model used in ASR) and generates the one-hot speaker label (or the speaker probability) on the output is trained. D-vector is the averaged activation from the last hidden layer of this DNN. So unlike the i-vector framework, this doesn't have any assumptions about the feature's distribution (the i-vector framework assumes that the i-vector, or the latent variable has a Gaussian distribution). Origin: https://ieeexplore-ieee-org.proxy.library.cmu.edu/document/6854363

Steps:

  1. Development: the background model will be created for the speaker representation
  2. Enrollment: the speaker models of new users are generated using the background model
  3. Evaluation: the claimed identity of the test utterances should be confirmed/rejected by comparing with available previously generated speaker models.

Text Dependent Works for Recent Years:

  1. Deep neural networks for small footprint text-dependent speaker verification [IEEE] During speaker enrollment, the trained DNN is used to extract speaker specific features from the last hidden layer.
  2. Exploring Sequential Characteristics in Speaker Bottleneck Feature for Text-Dependent Speaker Verification[IEEE]: Speaker supervector, noise not introduced, the best EER performance 1.627%
  3. HMM-Based Phrase-Independent i-Vector Extractor for Text-Dependent Speaker Verification[IEEE] Hidden Markov model (HMM) based extension of the i-vector approach, noise not introduced, pretty good EER