Workflow of the project

ahsanmemon commented 5 years ago

Can you describe the workflow of the project. for example. Step 1. run the preprocess script Step 2. run the dvector_create script Step 3. and maybe you can take it from here

bitnom commented 5 years ago

This would be very helpful. I am not understanding this "The TIMIT .WAV files must be converted to the standard format (RIFF)." I thought RIFF were wav headers?

ahsanmemon commented 5 years ago

This would be very helpful. I am not understanding this "The TIMIT .WAV files must be converted to the standard format (RIFF)." I thought RIFF were wav headers?

Can you specify where it asks to convert .wav files to RIFF format? I think we can just input data to the create_dvector.py file and get outputs

HarryVolek commented 5 years ago

I provided the preprocess script to convert the TIMIT dataset into .npy files suitable for training the neural network, but it should be fairly easy to modify the scripts to be suitable to another dataset if you need.

The TIMIT dataset provides the .WAV files in the "NIST SPHERE format", which does not matter to the neural network, but does matter to the VAD I used in the dvector_create script. Your download of the TIMIT dataset should also contain binaries for converting from the "NIST SPHERE format" to the more standard RIFF.

The neural network in the repo can be used standalone for speaker verification, but people were interested in using it as an input for https://github.com/google/uis-rnn for speaker diarization, so I provided the dvector_create.py script to make the output of this repo compatible as an input for theirs.

With this information in mind, if your final goal was to train google's UIS-RNN, the workflow would be: Download a dataset -> preprocess -> train this NN -> dvector_create -> train their NN.

nidhal1231 commented 5 years ago

@HarryVolek
Thank you for this work I have some questions and I would be grateful if you answer them (this is my graduation project and this is extremly important for me )

the output of embedder-net() function is a [N,256] I need to understand what is N exactly is it the number of sliding windows (240ms)?
Can we use this output (embedder-net() function output) for speaker diarization (can we apply clustering algorithms to this sequences for speaker diarization)
Can I understand how did you build train-sequence and train-cluster-id (the input of uis rnn ) because my dataset is different from TIMIT-corpus (Timit-corpus is a speaker recognition dataset not a speaker diarization dataset ) this is a link to the corpus I am using : https://github.com/EMRAI/emrai-synthetic-diarization-corpus Thank you in advance for your help

ahsanmemon commented 5 years ago

I provided the preprocess script to convert the TIMIT dataset into .npy files suitable for training the neural network, but it should be fairly easy to modify the scripts to be suitable to another dataset if you need.

The TIMIT dataset provides the .WAV files in the "NIST SPHERE format", which does not matter to the neural network, but does matter to the VAD I used in the dvector_create script. Your download of the TIMIT dataset should also contain binaries for converting from the "NIST SPHERE format" to the more standard RIFF.

The neural network in the repo can be used standalone for speaker verification, but people were interested in using it as an input for https://github.com/google/uis-rnn for speaker diarization, so I provided the dvector_create.py script to make the output of this repo compatible as an input for theirs.

With this information in mind, if your final goal was to train google's UIS-RNN, the workflow would be: Download a dataset -> preprocess -> train this NN -> dvector_create -> train their NN.

Great. I am done up till the "train this NN" part over a personal dataset made of 10 speakers. Until what loss value do you think I should train the embeddings network?

I am getting the following losses.

HarryVolek commented 5 years ago

Hi @ahsanmemon

The absolute value of a loss function at a given point is fairly arbitrary and dependent on the function. The trend is really what matters.

With that being said, from the image you are showing me I don't think the network is trained, as the loss is bouncing around significantly.

However, the best way to know for sure is to switch to testing mode and see the accuracy for yourself.

chrisspen commented 4 years ago

If we're all training with the same TIMIT data, shouldn't there be a standard Tloss we should all get at which point the network is trained?

I don't have a GPU, and the CPU version of torch is painfully slow. I see the default setting is to train 950 epochs. However, after 3 hours, it's only been able to complete 15 epochs on my system. At this rate, it's going to take a week just to train on the small TIMIT dataset. Is that normal? Should I terminate the training before that? There doesn't appear to be any break point in the code to end the training when the loss reaches a threshold, which is unusual in NN systems. What should be a "good" Tloss value for the TIMIT data?

HarryVolek / PyTorch_Speaker_Verification

Workflow of the project #38