WeidiXie / VGG-Speaker-Recognition

Utterance-level Aggregation For Speaker Recognition In The Wild
364 stars 98 forks source link

Different between the average and VLAD pooling #18

Closed mycrazycracy closed 5 years ago

mycrazycracy commented 5 years ago

Hi,

This paper and the idea is pretty interesting! May I ask two questions about the details please?

  1. I found the idea of LDE (learnable dictionary encoding, cited in the paper as [Cai et.al.]) is very similar with the NetVLAD (if not the same). I'm wondering what your opinion is about the different between LDE and NetVLAD used in this paper?

  2. After going through the code, I found the forward propagation of the VLAD and average pooling seems different. For average pooling, the output of resnet_2D_v1/v2 is directly used, which makes the shape to be [batch, 7, 16, D] -> [batch, 84, D] (after pooling, no additional layer)

For VLAD, the output is processing by an additional Conv2D layer, making the shape: [batch, 7, 16, D] -> [batch, 1, 16, D] (feat, use Conv2D) / [batch, 1, 16, n_clusters] (cluster_score) -> [batch, D * n_clusters] (after VLAD)

The additional layer may lead to better performance. Maybe this is part of the reasons why the TAP performs poorly in the paper?

Last, the performance comparison in the paper is really useful. Good work :-)

WeidiXie commented 5 years ago

Hi, @mycrazycracy

Thanks for your interest, to answer your questions,

  1. I think you are right, it's similar idea. Basically, the problem of speaker verification requires orderless pooling, and the discriminative clustering becomes a good option to achieve that purpose. Besides, we also realise the network should be equipped with the capability to evaluate the quality of the data, and try to reject the low-quality signals while doing verification, that's the idea of GhostVLAD.

Actually, our work is very inspired by the previous work in template-based face recognition, [1] W.Xie, A.Zisserman, "Multicolumn Networks for Face Recognition", In BMVC, 2018. [2] W.Xie, L. Shen, and A. Zisserman, "Comparator Networks", In ECCV, 2018. [3] Y. Zhong, R. Arandjelović, and A. Zisserman, "GhostVLAD for Set-based Face Recognition", In ACCV, 2018

  1. Yes, the VLAD version does have more parameters. I see your point, this is not convincing enough, I may re-train this TAP (temporal average pooling) network, But from our previous experience, the TAP doesn't perform well on verification.
    My personal intuition is, the TAP will make the training process too easy, think about three different points in a 2D space, they form a triangle, but if you always optimise the model based on the centre point of the triangle, there won't be any constraints on the three points, therefore, the three points can have so many situations while keeping their centre point unmoved. (This is only my intuition, we don't have an idea in what's happening in the high-dimensional space, so don't take this too serious.....)

Meanwhile, CNNs are usually so powerful, in order to generalise well, my experience is, I always have to make the training very hard by attacking the model, especially for the open-set verification task.

Best, Weidi

mycrazycracy commented 5 years ago

Hi Weidi,

Thanks for the answer.

For the second question, what do you think if I use statistics pooling rather than average pooling? If the second (or higher) order statistics is used, the scenerio (triangle embeddings) you describe will not happen since the deviation is considered. I found the statistics pooling performs well and achieves comparable results with attentive pooling.

WeidiXie commented 5 years ago

I see, I've never tried the statistical pooling thing.

When I worked on face recognition before, I thought about the bilinear pooling, but it takes too much computation, then I decided to use the "mean" as an anchor, and every sample only need compare with this anchor, as referred in my previous reply (the multicolumn network). This is an approximation of the higher order statistics, and it works pretty well on template-based face recognition, but I haven't tried anything like this on speaker recognition.

Best, Weidi

mycrazycracy commented 5 years ago

Thanks!

I just read the multicolumn network paper. It is interesting to use the quality and the content as the weights (similar with the attention weights).

"an approximation of the higher order statistics" do you mean the relationship with the mean reflects the "something" as the higher order statistics?

mycrazycracy commented 5 years ago

BTW, is there any implementation of the multicolumn network? Thanks.

WeidiXie commented 5 years ago

Yes, that's what I meant by approximating higher order statistics.

It's not publicly available yet, but I can send you the slightly messy version if you need.

Best, Weidi

mycrazycracy commented 5 years ago

I see. I may do some researches about the pooling so I will include that as the reference. It should be easy to implement the network using TF or pytorch. Thank you for your replies :-)

WeidiXie commented 5 years ago

Cool.