WeidiXie / VGG-Speaker-Recognition

Utterance-level Aggregation For Speaker Recognition In The Wild
364 stars 97 forks source link

why the wav preprocess not directly use librosa.feature.melspectrogram? what's the difference? #14

Closed mmxuan18 closed 5 years ago

WeidiXie commented 5 years ago

For this part, I simply used codes from other people.

mmxuan18 commented 5 years ago

image for this place, the librosa source code example magphase's param not use T operation, do this has any influence? image

the loss and acc change not almost, maybe something wrong image

WeidiXie commented 5 years ago

this is not wrong, please do learning rate warmup while training.

mmxuan18 commented 5 years ago

param set like this --net resnet34s --batch_size 160 --gpu 0,1 --lr 0.001 --warmup_ratio 0.1 --optimizer adam --epochs 128 --loss softmax

WeidiXie commented 5 years ago

then it should get EER around 3.4-3.5 easily, and train longer, it should be around 3.2-3.3 https://github.com/WeidiXie/VGG-Speaker-Recognition/issues/10#issue-420380352