Anwarvic / Speaker-Recognition

This repo contains my attempt to create a Speaker Recognition and Verification system using SideKit-1.3.1
108 stars 32 forks source link
gmm gmm-ubm i-vector identity-vector identity-verification sidekit speaker-identification speaker-recognition speaker-verification ubm

Speaker Recognition using SideKit

This repo contains my Speaker Recognition/Verification project using SideKit.

Speaker recognition is the identification of a person given an audio file. It is used to answer the question "Who is speaking?" Speaker verification (also called speaker authentication) is simliar to speaker recognition, but instead of return the speaker who is speaking, it returns whether the speaker (who is claiming to be a certain one) is truthful or not. Speaker Verification is considered to be a little easier than speaker recognition.

Recognizing the speaker can simplify the task of translating speech in systems that have been trained on specific voices or it can be used to authenticate or verify the identity of a speaker as part of a security process. Speaker recognition has a history dating back some four decades and uses the acoustic features of speech that have been found to differ between individuals. These acoustic patterns reflect both anatomy and learned behavioral patterns.

SideKit

SIDEKIT is an open source package for Speaker and Language recognition. The aim of SIDEKIT is to provide an educational and efficient toolkit for speaker/language recognition including the whole chain of treatment that goes from the audio data to the analysis of the system performance.

Authors: Anthony Larcher & Kong Aik Lee & Sylvain Meignier Version: 1.3.1 of 2019/01/22 You can check the official documentation, altough I don't recommend it, from here. Also here is the API documentation.

To run SIDEKIT on your machine, you need to:

IMPORTANT NOTE:

There is no need to install SIDEKIT as the library isn't stable and requires some manuevering so I cloned the project from gitLab usinggit clone https://git-lium.univ-lemans.fr/Larcher/sidekit.git and did some editing. So, you just need to clone my project and you are ready to go!!

Download Dataset

This project is just a proof-of-concept, so it was built using the merged vesion of a small open-source dataset called the "Arabic Corpus of Isolated Words" made by the University of Stirling located in the Central Belt of Scotland. This dataset can be downloaded from here.

This dataset is a voice-recorded dataset of 50 Native-Arabic speakers saying 20 words about 10 times. It has been recorded with a 44100 Hz sampling rate and 16-bit resolution. This dataset can be used for tasks Speaker Recognition, Speaker Verification, Voice biometrics, ... etc.

This dataset (1GB) is divided into:

After downloading the dataset and extracting it, you will find about 50 folders with the name of "S+speakerId" like so S01, S02, ... S50. Each one of these folders should contain around 20 audio files for every speaker, each audio file contains the audio of the speaker speaking 10 in a single WAV file words. This is repeated for 10 times/sessions. And these words are:

first_wav_words = {
        "01": "صِفْرْ", 
        "02":"وَاحِدْ",
        "03":"إِثنَانِْ",
        "04":"ثَلَاثَةْ",
        "05":"أَربَعَةْ",
        "06":"خَمْسَةْ",
        "07":"سِتَّةْ",
        "08":"سَبْعَةْ",
        "09":"ثَمَانِيَةْ",
        "10":"تِسْعَةْ"
}

second_wav_words = {
        "01":"التَّنْشِيطْ",
        "02":"التَّحْوِيلْ",
        "03":"الرَّصِيدْ",
        "04":"التَّسْدِيدْ",
        "05":"نَعَمْ",
        "06":"لَا",
        "07":"التَّمْوِيلْ",
        "08":"الْبَيَانَاتْ",
        "09":"الْحِسَابْ",
        "10":"إِنْهَاءْ"
}

How it Works

The sideKit pipeline consists of six steps as shown in the following image:

As we can see, the pipeline consists of six main steps:

All the configuration options for all previous steps can be found in a YAML file called conf.yaml. We will discuss most of these configurations, each at its associated section.

Now, let's talk about each one of these processes in more details:

1. Preprocessing

The file responsible for data pre-processing is data_init.py in which I split the whole data into two groups (one for training -enroll- and the other for testing). Then doing some preprocessing over the two sets to match the case that I'm creating this model for, like:

In the configuration file conf.yaml, you can modify only these:

The output from this step can be found at audio directory inside the outpath directory defined in the configuration file as YAML variable.

2. Structure

This step is done in the data_init.py script as well. By structuring, I mean create index files and idmap files for Sidekit to use. Basically, we need to create three files at least:

The output of this step can be found at task directory inside the outpath directory defined in the configuration file as YAML variable.

3. Feature Extraction

The file responsible for the feature extraction is extract_features.py in which I extract features from the preprocessed audio files and save them into a new folder called feat at the directory represented by outpath yaml variable.

This process uses the following yaml variables inside conf.yaml:

There is also a method called review_member_variables that resets these member varibales back to None based on the features used in the configuration file.

The output of this step can be found at feat directory inside the outpath directory defined in the configuration file as YAML variable.

You can download the features used in my model from here (32MB). After downloading, you should extract them at directory defined as the outpath YAML variable.

4. Choosing Model

In Sidekit, there are different models that we can train. I haven't been able to implement all the models, but the following are the ready ones:

5. Train

Now, we have everything ready for training our chosen model. See, we have preprocessed the input data, split them into train (enroll) and test, extracted features, chose the prefered model and its configuration. Now, we are ready to train this model. Each model has a script to train that model. If you chose UBM, then run ubm.py file. If you chose ivector, then run i-vector.py.

5. Evaluate

By evaluate the mode, I mean get the accuracy over the test set and drawing DET graph. This step is done in the model's script

TO BE CONTINUED :)