Question on last ReLU layer for evaluating 512-dim vector

seungwonpark commented 5 years ago

Hi, thanks again for open-sourcing this Speaker Recognition system and kindly replying to every issue.

I have a question about the model shown here. When evaluating final 512-dimension output(embedding vector), ReLU activation is applied at last, as shown in https://github.com/WeidiXie/VGG-Speaker-Recognition/blob/master/src/model.py#L139-L147. Hence, the output looks like:

[[1.15972664e-02 5.04462933e-03 2.85871420e-02 0.00000000e+00
  3.08723319e-02 0.00000000e+00 3.42872031e-02 4.36003655e-02
  0.00000000e+00 1.12573527e-01 5.46368458e-31 5.64192347e-02
  2.56476291e-02 0.00000000e+00 2.51553692e-02 4.77801599e-02
  0.00000000e+00 3.06680351e-02 2.24540825e-03 0.00000000e+00
  1.33734914e-02 2.91635211e-31 2.31502447e-02 5.39273359e-02
  9.22401696e-02 0.00000000e+00 3.31045166e-02 5.57319149e-02
  1.24792336e-02 4.04326282e-02 6.75894767e-02 0.00000000e+00
  6.08060285e-02 4.47864346e-02 2.85473187e-02 0.00000000e+00
... (truncated)

Here, we can observe that some values are 0.

In my opinion, the last ReLU layer is eliminating some information by erasing all negative values. Moreover, it limits the area of hypersphere where embeddings can exist, by a factor of 1/2^512. So, my question is: was the last ReLU layer necessary?

I strongly believe that it was necessary(since it's currently SotA on Speaker Recognition in the wild!), but I couldn't guess the necessity of the last ReLU layer. I would like to kindly ask you about that. Thanks in advance.

WeidiXie commented 5 years ago

Hi, @seungwonpark

During training, it's necessary, because this layer output will be passed to the classifier for identity classification, and you need non-linearity, otherwise, if it's linear, then it collapses to linear matrix multiplication.

And if you use this ReLU during training, it means you are not tuning parameters which produce negative values, because the gradient is 0. Therefore, the absolute values for these negative neutrons won't encode information at all.

During testing, it's actually not necessary, if you get rid of this ReLU, then similarity will be in range [-1,1], but from my previous experience on Face Recognition, it won't affect the EER(or equivalently ROC in Face Recognition) too much, because for these metrics, you only care the score boundary between positives and negatives, meaning you don't care the easy negative and easy positives. The range difference will only affect the easy negatives and positives, push them to -1, or 1.

But you can try this, honestly, I didn't try to remove the ReLU on this code during testing.

Best, Weidi

seungwonpark commented 5 years ago

Thanks for your reply! Now I understand that it's necessary for training.

But you can try this, honestly, I didn't try to remove the ReLU on this code during testing.

I've just tried it before and could get EER 3.30% for VoxCeleb1-Test. Since the original was 3.22%, looks like the last ReLU isn't affecting the EER. Moreover, most of the negative values of embedding vector components are close to 0, indicating the presence of that ReLU during evaluation isn't hurting the embedding much.

[[-1.12397848e-02  1.94295887e-02  4.09846008e-03 -3.91268915e-33
   2.68148761e-02 -4.01529071e-33  8.63662641e-03  9.84642748e-03
  -5.12239663e-33  1.15609013e-01  4.07438426e-31  3.68193574e-02
   6.41137436e-02 -6.15314394e-03  5.28822616e-02  3.91481742e-02
  -4.16713005e-33  3.22886631e-02  3.15483985e-03  5.89091564e-03                            
   1.68156493e-02  4.49916584e-32  3.11672390e-02  2.95981187e-02                                                                                                                                                                             
   7.33051971e-02 -5.09047498e-33  3.45449373e-02  6.27251342e-02                                 
   5.33877313e-03  3.36697400e-02  3.82690169e-02 -2.37590764e-02                                                                                                            
   4.71225530e-02  1.41386166e-02  3.46583165e-02 -3.63806098e-33                                                   
   6.70032650e-02  1.84932332e-02 -3.78111643e-33  8.17786777e-32    
   4.48756069e-02  6.91843927e-02 -4.65606079e-02  2.27666665e-02
   1.32091362e-02  7.15097710e-02  2.48701498e-02 -1.92180276e-02  
   1.86939258e-02  3.24079171e-02 -2.73935352e-32  7.53543153e-02                                                 
   4.35039513e-02  1.04446925e-01  3.58447544e-02  2.91956440e-02
  -5.04244171e-33  1.49013519e-01  1.35290360e-31 -9.76120494e-03
... (truncated)

Now my question is resolved. Thank you!

WeidiXie / VGG-Speaker-Recognition

Question on last ReLU layer for evaluating 512-dim vector #20