WeidiXie / VGG-Speaker-Recognition

Utterance-level Aggregation For Speaker Recognition In The Wild
363 stars 97 forks source link

two question about the vladpooling implement compare to the paper of netvlad? #36

Closed mmxuan18 closed 5 years ago

mmxuan18 commented 5 years ago

image this is the netvlad author's presentation, as show with yellow circle the feature map x is the same for two branch. but in your code, has some different: 1: frome feature map, x --> x_fc, and x --> x_k_center, then this two pass to vladpooling which do softmax and normalization, as compare to netvlad, look like x --> fc unnecessary 2: before compute softmax, why need to sub max first, this is seem not very common? 3: in netvlad there first do intra-normalization then l2-normalization (as one paper refer this improve the acc) but here only one l2-normaliztion

what's benefit will get from these differents?

i train use netvlad and vladpooling on my dataset with same optim params and use grad-cam to see the featmaps activation different, same times vladpooling will activate the background noise, but netvlad not: orignal code: image orignal code add self_attetion after featmap and before vladpooling: image self_attetion after featmap and before netvlad: image

WeidiXie commented 5 years ago

VLAD is a general technique for quantisation, I simply use the bit that I think the best fit for my application, you can use the original netvlad implementation.

  1. I don't think this matters, as the model can learn to use the same weights if that's really required for the performance.

  2. Please check source code in any toolbox for implementing softmax, this is essential to keep numerical stability.

  3. Too many L2 norm will make your training very difficult, as the feature will be mapped to an hypersphere, similarly the gradients will be collapsed to a small space, indeed, it will add more regularisation, make the model more robust, but it makes training slow.