Training accuracy oscillates between 51-52% early while loss decreases slowly.

alamnasim commented 5 years ago

I have 4.5k speakers and 88k utterances (total 150k pairs of text) total with each range between 4-10 sec in Hindi-English mixed(98% Hindi).

I tried to run VGG-Speaker-Recognition on my dataset, and is giving me the following result:

Epoch 1/50 Learning rate for epoch 1 is 0.0001. 1130/1130 [==============================] - 11718s 10s/step - loss: 1.6074 - acc: 0.5193 Epoch 2/50 Learning rate for epoch 2 is 0.0001. 1130/1130 [==============================] - 11616s 10s/step - loss: 1.2483 - acc: 0.5243 Epoch 3/50 Learning rate for epoch 3 is 0.0001. 1130/1130 [==============================] - 11591s 10s/step - loss: 1.2341 - acc: 0.5260 Epoch 4/50 Learning rate for epoch 4 is 0.0001. 1130/1130 [==============================] - 11512s 10s/step - loss: 1.2289 - acc: 0.5239 Epoch 5/50 Learning rate for epoch 5 is 0.0001. 1130/1130 [==============================] - 11470s 10s/step - loss: 1.2255 - acc: 0.5281 Epoch 6/50 Learning rate for epoch 6 is 0.0001. 1130/1130 [==============================] - 11548s 10s/step - loss: 1.2246 - acc: 0.5264 Epoch 7/50 Learning rate for epoch 7 is 0.0001. 1130/1130 [==============================] - 11550s 10s/step - loss: 1.2228 - acc: 0.5278 Epoch 8/50 Learning rate for epoch 8 is 0.0001. 1130/1130 [==============================] - 11602s 10s/step - loss: 1.2223 - acc: 0.5273 Epoch 9/50 Learning rate for epoch 9 is 0.0001. 1130/1130 [==============================] - 11620s 10s/step - loss: 1.2211 - acc: 0.5292 Epoch 10/50 Learning rate for epoch 10 is 0.0001. 1130/1130 [==============================] - 11581s 10s/step - loss: 1.2206 - acc: 0.5284 Epoch 11/50 Learning rate for epoch 11 is 0.0001. 1130/1130 [==============================] - 11544s 10s/step - loss: 1.2203 - acc: 0.5272 Epoch 12/50 Learning rate for epoch 12 is 0.0001. 1130/1130 [==============================] - 11467s 10s/step - loss: 1.2196 - acc: 0.5286 Epoch 13/50 Learning rate for epoch 13 is 0.0001. 1130/1130 [==============================] - 11399s 10s/step - loss: 1.2191 - acc: 0.5294

As I can see above training accuracy oscillates between 51-52 % and it seems like model overfitted at 51-52 %. Earlier I had 500 speakers (a subset of 4.5k speakers), I got the same result at that time as well.

What could be the reason for this result? Please help @WeidiXie.

i7p9h9 commented 5 years ago

I'm not author, but i trained this system successful. Did you try use wurmap? Perhaps you should decrease learning rate after some epochs without improvement loss. In original paper authors decrease lr by 10 times each 32 epoch. More over, i think your training time too long for this size of epoch, are you learn using GPU? Before train, you should convert audio file to wav, in other case you will have had bottleneck in feature extraction. One more thing, i notesed that SGD work better in this task compared to Adam.

alamnasim commented 5 years ago

Thanks for the reply. I have used the default value of warmup_ratio as 0. I tried it on GPU as well, it did not use GPU at all and training time was same. I have all the data files in wave format. Ok, I will try SGD as well.

WeidiXie / VGG-Speaker-Recognition

Training accuracy oscillates between 51-52% early while loss decreases slowly. #35