Investigate voice verification algorithms based on ML/DL

A Brief Summary:

Relevant Concept:

Speaker Verification (SV), is verifying the claimed identity of a speaker by using their voice characteristics as captured by a recording device such as a microphone.
I-Vector and D-Vector (Quoted from Quora, just an informal way to quickly explain what are these concepts): I-vector: a feature that represents the idiosyncratic characteristics of the frame-level features' distributive pattern. I-vector extraction is essentially a dimensionality reduction of the GMM supervector (although the GMM supervector is not extracted when computing the i-vector). It's extracted in a similar manner with the eigenvoice adaptation scheme or the JFA technique, but is extracted per sentence (or input speech sample). Origin: https://groups.csail.mit.edu/sls/publications/2011/Dehak_IEEE_May2011.pdf D-vector is extracted using DNN. To extract a d-vector, a DNN model that takes stacked filterbank features (similar to the DNN acoustic model used in ASR) and generates the one-hot speaker label (or the speaker probability) on the output is trained. D-vector is the averaged activation from the last hidden layer of this DNN. So unlike the i-vector framework, this doesn't have any assumptions about the feature's distribution (the i-vector framework assumes that the i-vector, or the latent variable has a Gaussian distribution). Origin: https://ieeexplore-ieee-org.proxy.library.cmu.edu/document/6854363

Steps:

Development: the background model will be created for the speaker representation
Enrollment: the speaker models of new users are generated using the background model
Evaluation: the claimed identity of the test utterances should be confirmed/rejected by comparing with available previously generated speaker models.

Text Dependent Works for Recent Years:

Deep neural networks for small footprint text-dependent speaker verification [IEEE] During speaker enrollment, the trained DNN is used to extract speaker specific features from the last hidden layer.
Exploring Sequential Characteristics in Speaker Bottleneck Feature for Text-Dependent Speaker Verification[IEEE]: Speaker supervector, noise not introduced, the best EER performance 1.627%
HMM-Based Phrase-Independent i-Vector Extractor for Text-Dependent Speaker Verification[IEEE] Hidden Markov model (HMM) based extension of the i-vector approach, noise not introduced, pretty good EER

Practicum-PnG / practicum-png