WeidiXie / VGG-Speaker-Recognition

Utterance-level Aggregation For Speaker Recognition In The Wild
362 stars 98 forks source link

about loss and acc #61

Closed hermanseu closed 4 years ago

hermanseu commented 4 years ago

HI WeidiXie, thanks for your paper and code.

I have 15899 speakers and 317980 utterances(20 utterances per speaker). When trying to use the data to train a model, i get a decreasing acc not increasing. The batch_size is 8, other params are default. It's no problem about the data that I have checked. There must be something wrong happened. After 30 epochs, the loss and acc are almost same as before. Can you give me some advice to solve the problem. vgg_acc

hermanseu commented 4 years ago

one question more:

In readme, the batch_size is 160 as suggested. I have two gpu cards with 8G memory, when batch_size=16, the half of gpu memory will be used, when batch_size=32, the gpu will be out of memory. My gpu memory is too small?

WeidiXie commented 4 years ago

can you do a learning rate warmup, for example, use lr=1e-4 at the beginning, and the raise back to 1e-3.

hermanseu commented 4 years ago

yes, i have learning rate warmup. In one epoch, the acc is decreased continually. command : python3 main.py --net resnet34s --batch_size 8 --gpu 1 --lr 0.001 --warmup_ratio 0.1 --optimizer adam --epochs 64 --multiprocess 8 --loss softmax vgg_acc_2

I try to adjust the machine environment with Python 2.7 Keras 2.2.4 Tensorflow 1.8.0, the output is same with python3.

WeidiXie commented 4 years ago

Hm, not sure, it might be because of some library updates, but when I release the code, it definitely worked, check some solved issues, for example,

https://github.com/WeidiXie/VGG-Speaker-Recognition/issues/10#issue-420380352

Anyway, I would then debug from two perspectives:

  1. pick a small subset, see if you can overfit it.
  2. remove the VLAD layers, simply do an average pooling at the end, if you still can't train, then there might be something wrong with the data loader.
hermanseu commented 4 years ago
  1. I have read all the issues, but i have not found helpful info about my question.

  2. When trying to use avg aggregation_mode, I get a assertion about categorical_crossentropy. The target label dim and predict label dim are not match. Because the output dim of the reshape operation is 3 not 2. It's maybe a bug, or my version is not match. And I modify the output dim as 2, the acc is still decreasing. vgg_acc_3

  3. pick a small subset with 50 speakers(1000 utterances), use ghostvlad, the acc is normal, 0.994 after 64 epochs, I guess it overfited. I have 15899 speakers, but only 20 utterances per speaker. Maybe the number of utterance of a speaker is too small to train the model?

WeidiXie commented 4 years ago

OK, cool, so that means there is nothing wrong with the code, the rest is about your training schedule, maybe do curriculum, start from small number of speakers, and gradually adding more, this is beyond my responsibility, so I'll close this issue now.