Closed mycrazycracy closed 5 years ago
Hi, @mycrazycracy
Thanks for your interest, to answer your questions,
Actually, our work is very inspired by the previous work in template-based face recognition, [1] W.Xie, A.Zisserman, "Multicolumn Networks for Face Recognition", In BMVC, 2018. [2] W.Xie, L. Shen, and A. Zisserman, "Comparator Networks", In ECCV, 2018. [3] Y. Zhong, R. Arandjelović, and A. Zisserman, "GhostVLAD for Set-based Face Recognition", In ACCV, 2018
Meanwhile, CNNs are usually so powerful, in order to generalise well, my experience is, I always have to make the training very hard by attacking the model, especially for the open-set verification task.
Best, Weidi
Hi Weidi,
Thanks for the answer.
For the second question, what do you think if I use statistics pooling rather than average pooling? If the second (or higher) order statistics is used, the scenerio (triangle embeddings) you describe will not happen since the deviation is considered. I found the statistics pooling performs well and achieves comparable results with attentive pooling.
I see, I've never tried the statistical pooling thing.
When I worked on face recognition before, I thought about the bilinear pooling, but it takes too much computation, then I decided to use the "mean" as an anchor, and every sample only need compare with this anchor, as referred in my previous reply (the multicolumn network). This is an approximation of the higher order statistics, and it works pretty well on template-based face recognition, but I haven't tried anything like this on speaker recognition.
Best, Weidi
Thanks!
I just read the multicolumn network paper. It is interesting to use the quality and the content as the weights (similar with the attention weights).
"an approximation of the higher order statistics" do you mean the relationship with the mean reflects the "something" as the higher order statistics?
BTW, is there any implementation of the multicolumn network? Thanks.
Yes, that's what I meant by approximating higher order statistics.
It's not publicly available yet, but I can send you the slightly messy version if you need.
Best, Weidi
I see. I may do some researches about the pooling so I will include that as the reference. It should be easy to implement the network using TF or pytorch. Thank you for your replies :-)
Cool.
Hi,
This paper and the idea is pretty interesting! May I ask two questions about the details please?
I found the idea of LDE (learnable dictionary encoding, cited in the paper as [Cai et.al.]) is very similar with the NetVLAD (if not the same). I'm wondering what your opinion is about the different between LDE and NetVLAD used in this paper?
After going through the code, I found the forward propagation of the VLAD and average pooling seems different. For average pooling, the output of resnet_2D_v1/v2 is directly used, which makes the shape to be [batch, 7, 16, D] -> [batch, 84, D] (after pooling, no additional layer)
For VLAD, the output is processing by an additional Conv2D layer, making the shape: [batch, 7, 16, D] -> [batch, 1, 16, D] (feat, use Conv2D) / [batch, 1, 16, n_clusters] (cluster_score) -> [batch, D * n_clusters] (after VLAD)
The additional layer may lead to better performance. Maybe this is part of the reasons why the TAP performs poorly in the paper?
Last, the performance comparison in the paper is really useful. Good work :-)