Yunfan-Li / Contrastive-Clustering

Code for the paper "Contrastive Clustering" (AAAI 2021)
MIT License
300 stars 91 forks source link

Cluster Assignment Entropy #4

Closed LukasBommes closed 3 years ago

LukasBommes commented 3 years ago

Hey Yunfan,

first of all, this is really great work and a well written paper! Thanks for providing the code.

I am trying to reimplement your method and am a bit confused about the way you penalize the cluster asssignment matrix. In your code you do

p_i = c_i.sum(0).view(-1)
p_i /= p_i.sum()
ne_i = math.log(p_i.size(0)) + (p_i * torch.log(p_i)).sum()
p_j = c_j.sum(0).view(-1)
p_j /= p_j.sum()
ne_j = math.log(p_j.size(0)) + (p_j * torch.log(p_j)).sum()
ne_loss = ne_i + ne_j

This gives a loss of 1.97 for the example cluster assignment matrix Y with 3 clusters and 2 samples + 2 augmented samples below

Y = torch.Tensor([[0.98, 0.01, 0.01],
                   [0.98, 0.01, 0.01],
                   [0.98, 0.01, 0.01],
                   [0.98, 0.01, 0.01]])

c_i = Y[:2]
c_j = Y[2:]

Your code seems to differ quite a lot from how you write it in the paper. According to your paper I would have done the following

Y_one_norm = torch.linalg.norm(Y, ord=1)
c_i = c_i.sum(dim=0)/Y_one_norm
c_j = c_j.sum(dim=0)/Y_one_norm
ne_loss = (c_i*torch.log(c_i)+c_j*torch.log(c_j)).sum()

which gives a loss of -1.20.

Could you kindly let me know your intention of implementing the loss the way you did and why it seems to differ from the maths in your paper?

Thanks!

Lukas

Yunfan-Li commented 3 years ago

Hi, Lukas,

Sorry for the mistake we made in the current arXiv version of the paper. I confused the definition of L1-norm of vector and matrix. The L1-norm here should be replaced by the sum of all values in the matrix, i.e., the batch size. This normalization is necessary to compute the cluster assignment entropy of the entire batch. Besides, the summation should be computed on each cluster assignment matrix, namely, Y^a and Y^b instead of Y.

I believe the current code should match the modified definition above. For the given example, your implementation should be accordingly modified to

Y_i_sum = Y[:2].sum()
c_i = c_i.sum(dim=0)/Y_i_sum
Y_j_sum = Y[2:].sum()
c_j = c_j.sum(dim=0)/Y_j_sum
ne_loss = (c_i*torch.log(c_i)+c_j*torch.log(c_j)).sum()

and this should give the result -0.2238. In our implementation, in addition to the standard entropy, we add a constant term math.log(p_i.size(0)) to make sure the entropy loss is always positive. If you add this term to the loss, namely,

ne_loss += math.log(c_i.size(0)) + math.log(c_j.size(0))

it would give 1.9734 which is consistent with our implementation.

Thanks a lot for pointing out the mistake and hope this could resolve your confusion. Do let me know if you have any further questions.

Yunfan

LukasBommes commented 3 years ago

Hey Yunfan,

thanks a lot for the thorough explanation. Now, it makes sense to me.

Lukas

Hzzone commented 3 years ago

Hi, I have the following questions about your paper:

First, I am very appreciative of your great work that gives me a significant insight into online clustering with contrastive learning. However, your online clustering is somewhat similar to [SwAV: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments, NIPS 2020], not only limited to 2 papers mentioned in https://github.com/Yunfan-Li/Contrastive-Clustering/issues/2, although I think there are quite differences in the two paper.

Second, you have used the entropy to avoid the trivial solution in Eq. 6, which is similar to SwAV, where they used it to equally partitioned the samples from [Self-labelling via simultaneous clustering and representation learning. International Conference on Learning Representations (ICLR) (2020)]. I have not carefully read your paper, but I think it's ok to cite these two papers.

Third, the entropy in Eq. 6 should be: H(x) = -\sum_{i=1}^np(x_i)\log(p(x_i)). Please see the Eq. 3 SwAV, and I agree that your paper is right. However, to avoid the trivial solution, the hyper-parameters of entropy regularization should be always larger than zeros. Specifically, the ε of SwAV is 0.05 while it is -1 in your code ( I emphasize that your code is not the same as your paper).

Yunfan-Li commented 3 years ago

Thanks for your kind advice.

Just like you said, there are quite differences between our work and SwAV. SwAV also uses the idea of "label as representation". However, it still performs contrastive learning at the instance-level whereas our work simultaneously performs contrastive learning at both the instance- and cluster-level. By the way, we have cited the second work (SL) in the camera-ready version.

For the entropy loss, I think my code does match the paper except that a constant term is added, as explained above (There is a minus sign missed in the definition of entropy in the arXiv version and maybe you mean that). As for the hyper-parameters of entropy regularization, we didn't explicitly finetune it and found a simple addition already gives a promising result. The hyper-parameter is larger than zero in SwAV because they maximize the objective function in Eq. 3 while we minimize the contrastive loss.

Hzzone commented 3 years ago

You are right, thanks for your explanation.