astorfi / 3D-convolutional-speaker-recognition

:speaker: Deep Learning & 3D Convolutional Neural Networks for Speaker Verification
Apache License 2.0
781 stars 275 forks source link

Default training not converging #3

Closed skanderm closed 6 years ago

skanderm commented 6 years ago

Running just the train_softmax.py command in the example run.sh script with the sample data doesn't seem to converge, even at 50 epochs.

Command:

python -u ./code/1-development/train_softmax.py --num_epochs=50 --batch_size=3 --development_dataset_path=data/development_sample_dataset_speaker.hdf5 --train_dir=results/TRAIN_CNN_3D/train_logs

Output:

image

Loss:

image

Learning rate:

image

astorfi commented 6 years ago

@skanderm The default training is just a running example on a randomly generated dataset for a demonstration of the training process and how all the parts are connected together! You should try your own dataset in a correct format and the details are available in the associated paper!

skanderm commented 6 years ago

Thanks Sina, that makes sense. I'd hoped the sample data would provide a proportional/converging example, but it's good to know it's not supposed to.

Given the dimensionality of the sample data, represented by (n*x, 80, 40, ζ), with n being number of speakers and x being cubes per speaker:

  1. What is n for development, enrollment, and evaluation? The paper mentions 1083 speakers in the data set. Is n 1083 for each phase? (Sample data: n is 4)
  2. What is x - number of cubes per speaker - for development, enrollment, and evaluation? (In sample data, x is 18-36 for dev and enrollment, 4 for evaluation)
  3. About how many epochs did it take to converge? You mentioned a minimum of 50 in another issue.
  4. The sample development data sets ζ to 20 (and the paper mentions going up to 40). Enrollment and evaluation in the sample data seems to set ζ = 1. Is that correct/representative?

Thank you!

astorfi commented 6 years ago

First of all, you said, "Given the dimensionality of the sample data, represented by (nx, 80, 40, ζ), with n being the number of speakers and x being cubes per speaker". My question is how did you come up with nx? Did I mention anywhere as this? Because the first dimension is simply the sample index and each sample is a cube. Although your understanding is correct and n*x can be (not always) the number of samples.

  1. N has nothing to do with the 1083 or the number of speakers. It is the number of generated samples and can be related to the number of speakers. In the dataset, there are 1083 speaker but only around 600 of them are used in the experiments! (Please refer to paper for further details)

  2. Same as above.

  3. It is actually database related. For us, at least 50 epochs.

  4. In the enrollment, the ζ is 20 as well which is the main idea of the paper. Creating a bridge between development and enrollment. for the evaluation, ζ = 1 but we create copies to has ζ = 20 for the same above reasoning. (Please refer to paper for further details)

skanderm commented 6 years ago

Thanks @astorfi. No, you didn't mention n*x anywhere in the paper. I figured n*x was a good summary of what the sample data contained.

I'm not seeing 600 speakers mentioned in the paper, just the 1083 speakers:

The dataset that has been used for our experiments is the WVU-Multimodal 2013 dataset [19]. The audio part of WVU-Multimodal dataset consists of up to 4 sessions of interviews for each of the 1083 different speakers. The WVU-Multimodal dataset includes different modalities of data collected over a period from 2013 to 2015. The audio part of data consists of both scripted and unscripted voice samples. For the scripted samples, the participants read a fixed sample of text. For the unscripted samples, the participants answer interview questions that require conversational responses. We only use the scripted audio samples, as only the voice of the subject of interest is present in the sample.

I'm trying to get a sense of the volume and breakdown of data to achieve 86%+ AUC. Roughly how many samples did you require per speaker per phase (dev, enrollment, eval) to achieve those results? I have roughly 100 speakers in my data set and I'm trying to figure out how to split them up.

Thank again!

astorfi commented 6 years ago

Section VI

The output of the last layer (FC5) will be fed to the softmax layer which has the cardinality of N = 511, where N is the number of speakers in the development phase. For the enrollment and evaluation stages, 100 subjects have been used and the speaker utterances are split into two equal parts for two aforementioned phases.

However, I accept it could be explained in a better way in the paper.

About the number of samples per speaker, In my experiments, I had more than 50 per speaker. Perhaps if you take 50 for development and 50 for enrollment/evaluation it can demonstrate a fair experiment. However, your background model may not be that strong for generalization. In any case, you can give it a shot!

skanderm commented 6 years ago

Oops, I might have had an old version of the paper:

image

I'll have a look at the newest version and re-open if there's an issue. Thanks for clarifying!