Closed utterances-bot closed 5 months ago
an easy explanation👍
Hi, thanks for the nice post! What if there is a repetitive class in a batch? For example, multiple cat images in a batch. In that case, encountering the same class images (eg, cat images in the batch) classified as negative pair is inevitable. How does SimCLR treat this situation?
@yoon28 It's a limitation with SimCLR. I had the same thought and asked the original paper author here. He replied that they do nothing to handle this.
It might be because the scale at which original SimCLR is trained has batches as large as 8K images and so this problem has less impact on the performance.
New papers such as PCL have tried tackling it through clustering. Alternatively, other works such as BYOL have removed the need for negative pairs itself.
Excellent job explaining it in such a simple manner! It was a great read. As regards the random chance of positive pairs being negative in the training set batch, it depends on the number of classes. Imagenet has 1000 classes, so with a batch size of 8192, and assuming equal number of samples in the entire dataset for each class, the chances of this happening are very small.
Do you know why the original image is not used in the positive pair? For e.g., why do we need x_i and x_j, both augmented from x, instead of using x and x_i (so you only do one augmentation)? I am guessing since x_i and x_j are generated using stochastic augmentation, they'll be different every time, whereas having the original could lead to poorer generalization.
Another question - why do the authors of the paper throw away the function g(.)? Isn't it possible that the additional non-linear transformation helps with a better representation?
To add to the above, the authors write "Furthermore, even when nonlinear projection is used, the layer before the projection head, h, is still much better (>10%) than the layer after, z = g(h), which shows that the hidden layer before the projection head is a better representation than the layer after.", but they don't give reasons why.
Never mind - they actually do conjecture why this is so. Their reasoning is that " In particular, z = g(h) is trained to be invariant to data transformation. Thus, g can remove information that may be useful for the downstream task, such as the color or orientation of objects."
I seek clarification with the following statement:
"We calculate the loss for the same pair a second time as well where the positions of the images are interchanged."
Isn't the loss using cosine similarity which is scalar and commutative? Why the need to check similarity for Image B with A when Image A with B would be the same value?
Maybe I am missing something.
@espkh4 The similarity in the numerator for Image A and Image B would be the same even after interchange of position. You have understood it correctly till that part.
But, for the dissimilar(negative) images in the denominator part, position matters. For A, we made it similar to B and dissimilar to other images through first loss function. So, for B, we also need to make it dissimilar to the other images and similar to A. The denominator part of the second loss function does that. We are interchanging positions to achieve that.
I hope that clears the confusion. Feel free to comment if it still doesn't make sense.
Do you know why the original image is not used in the positive pair? For e.g., why do we need x_i and x_j, both augmented from x, instead of using x and x_i (so you only do one augmentation)? I am guessing since x_i and x_j are generated using stochastic augmentation, they'll be different every time, whereas having the original could lead to poorer generalization.
What you suggest is one possible reason.
@amitness Yes, perfect. Thank you for that explanation. That clears it.
Hi Amit, I am trying to implement simClr on FashionMnsit Dataset. My implementation is done but I am having some issues with my analysis and huge running time. Can we have a small online meeting ? I really need your help. Could you please help me in anyway ?
@RezwanCode Can you add it to a private repo and invite me as a collaborator. I can have a look and give you feedback directly on the repo.
My github username is amitness
What is the real purpose of term temperature in the loss function? Please can you help me in understanding it with some intuitive example. Also, I found this temperature term in the MoCo paper; both of them means the same?
I found the following comment on this blog post (https://towardsdatascience.com/contrasting-contrastive-loss-functions-3c13ca5f055e), but I don't think that I really understood what does it mean.
"Chen et al. found that an appropriate temperature parameter can help the model learn from hard negatives. In addition, they showed that the optimal temperature differs on different batch sizes and number of training epochs."
Thanks
Great explanation!
Hi Amit, Nice explanation! A couple of questions regarding SSL in Computer Vision:
What happens if in a training batch, a majority of the images belong to the same class. Will an SSL algorithm not fail, because it will push images (embeddings) of the same class further apart?
Also, can you please explain what causes an SSL algorithm to push images of the same class to be pushed closer to each other? Because, as far as training is concerned, only the original and its augmented versions are being pushed close to each other. What causes other images of the same class to be pushed closer to each other?
@vahuja4 Regarding your first question, I had the same curiosity. The author answered it here and say that they don't handle it explicitly. I'm guessing it's not a problem for such a large dataset since it's very rare that all same class images would end up in the same batch.
Awesome explanation bro! Thank you for your efforts.
Thank you so much for this clear explanation!
Thank u very much ! It is really benefit for my understanding with constrastive learning.
you are the best of the best, nice explaining for the paper, you make it easy
It's a really great article. I have a question I want to ask you As the paper shows A large batch size will yield good results. due to the larger number of negative pairs Am I right? so that means We need a lot of memory to hold the batch. my question is If I use the multi-GPU method in this example https://keras.io/guides/distributed_training/ It will still work as if we are training with a large batch size or not? Or it doesn't just help train faster.
Thank you in advance
@MailSuesarn You're right! Though, multi-GPU training brings a new problem if you use batch normalization directly. The paper shows that there could be information leakage if batch normalization is applied locally to the small batches in each GPU.
So, instead of that, they use a "global batch normalization" where the statistics are calculated across all the images in all the GPUs. See this video.
You can search around to see if there is a easy to use implementation of that in keras
how do we train such network if we do not have multiple gpu?
Thanks for your explanation, A question, When we have the similarity, why we do not use that as loss function? Why we input the similarity to softmax?
Thank you for the clear explanation! I have one question on the Noise Contrastive Estimation(NCE) Loss, which has l[k!=i] in the denominator. According to the paper, l[k!=i] = 1 if k is not from the ith image, meaning that only negative pairs are summed up in the denominator. Please correct me if I'm wrong.
nice explanation!
Thank you for the easy explanation!
Very good and detailed article, but I have a problem. If my data set only has 2 labels, dog and cat, is training with batch size = 256 possible? Suppose a batch can have 128 cats and 128 dogs, so if you calculate the loss of 1 cat 'pos', does it still exclude the remaining 127 cat images?
@HuyTrinh212 If you already know the class labels, SimCLR wouldn't be a relevant model to apply. It's supposed to be used when you want to learn representations across diverse set of unlabeled images. For the cat/dog example, a simple supervised binary classification would work better.
However, your intuition on the issue you pointed out is correct. If a lot of same class images end up in the batch, it doesn't make sense to treat them as negatives. That's one of the drawbacks of the SimCLR approach. Some follow-up papers have used heuristics such as clustering images so that each batch has images from different clusters.
@amitness Very good explanation, thank you. i will read more about 'heuristics & clustering' to try.
The Illustrated SimCLR Framework
A visual guide to the SimCLR framework for contrastive learning of visual representations.
http://amitness.com/2020/03/illustrated-simclr/