amitness / blog-comments

Repo to host the comments with https://utteranc.es/

https://amitness.com/

0 stars 0 forks source link

2020/03/illustrated-simclr/ #16

Closed utterances-bot closed 5 months ago

utterances-bot commented 4 years ago

The Illustrated SimCLR Framework

A visual guide to the SimCLR framework for contrastive learning of visual representations.

http://amitness.com/2020/03/illustrated-simclr/

February24-Lee commented 4 years ago

an easy explanation👍

yoon28 commented 4 years ago

Hi, thanks for the nice post! What if there is a repetitive class in a batch? For example, multiple cat images in a batch. In that case, encountering the same class images (eg, cat images in the batch) classified as negative pair is inevitable. How does SimCLR treat this situation?

amitness commented 4 years ago

@yoon28 It's a limitation with SimCLR. I had the same thought and asked the original paper author here. He replied that they do nothing to handle this.

It might be because the scale at which original SimCLR is trained has batches as large as 8K images and so this problem has less impact on the performance.

New papers such as PCL have tried tackling it through clustering. Alternatively, other works such as BYOL have removed the need for negative pairs itself.

maxmaxmir commented 4 years ago

Excellent job explaining it in such a simple manner! It was a great read. As regards the random chance of positive pairs being negative in the training set batch, it depends on the number of classes. Imagenet has 1000 classes, so with a batch size of 8192, and assuming equal number of samples in the entire dataset for each class, the chances of this happening are very small.

maxmaxmir commented 4 years ago

Do you know why the original image is not used in the positive pair? For e.g., why do we need x_i and x_j, both augmented from x, instead of using x and x_i (so you only do one augmentation)? I am guessing since x_i and x_j are generated using stochastic augmentation, they'll be different every time, whereas having the original could lead to poorer generalization.

maxmaxmir commented 4 years ago

Another question - why do the authors of the paper throw away the function g(.)? Isn't it possible that the additional non-linear transformation helps with a better representation?

maxmaxmir commented 4 years ago

To add to the above, the authors write "Furthermore, even when nonlinear projection is used, the layer before the projection head, h, is still much better (>10%) than the layer after, z = g(h), which shows that the hidden layer before the projection head is a better representation than the layer after.", but they don't give reasons why.

maxmaxmir commented 4 years ago

Never mind - they actually do conjecture why this is so. Their reasoning is that " In particular, z = g(h) is trained to be invariant to data transformation. Thus, g can remove information that may be useful for the downstream task, such as the color or orientation of objects."

espkh4 commented 4 years ago

I seek clarification with the following statement:

"We calculate the loss for the same pair a second time as well where the positions of the images are interchanged."

Isn't the loss using cosine similarity which is scalar and commutative? Why the need to check similarity for Image B with A when Image A with B would be the same value?

Maybe I am missing something.

amitness commented 4 years ago

@espkh4 The similarity in the numerator for Image A and Image B would be the same even after interchange of position. You have understood it correctly till that part.

But, for the dissimilar(negative) images in the denominator part, position matters. For A, we made it similar to B and dissimilar to other images through first loss function. So, for B, we also need to make it dissimilar to the other images and similar to A. The denominator part of the second loss function does that. We are interchanging positions to achieve that. simclr

I hope that clears the confusion. Feel free to comment if it still doesn't make sense.

amitness commented 4 years ago

Do you know why the original image is not used in the positive pair? For e.g., why do we need x_i and x_j, both augmented from x, instead of using x and x_i (so you only do one augmentation)? I am guessing since x_i and x_j are generated using stochastic augmentation, they'll be different every time, whereas having the original could lead to poorer generalization.

What you suggest is one possible reason.

Furthermore, the authors have shown in the paper that using two random crops has benefits. For example, they show how two non-overlapping crops can simulate a task where you can predict neighboring crops. Similarly, you should be able to recognize the smaller part when it's just part of the larger part.
Another thing is regarding networks cheating. When you use both augmentations, it makes sure that there is no cheating through color histograms. You can refer to the ablation study in the paper for more details.

espkh4 commented 4 years ago

@amitness Yes, perfect. Thank you for that explanation. That clears it.

RezwanCode commented 3 years ago

Hi Amit, I am trying to implement simClr on FashionMnsit Dataset. My implementation is done but I am having some issues with my analysis and huge running time. Can we have a small online meeting ? I really need your help. Could you please help me in anyway ?

amitness commented 3 years ago

@RezwanCode Can you add it to a private repo and invite me as a collaborator. I can have a look and give you feedback directly on the repo.

My github username is amitness

Mushtaqml commented 3 years ago

What is the real purpose of term temperature in the loss function? Please can you help me in understanding it with some intuitive example. Also, I found this temperature term in the MoCo paper; both of them means the same?

I found the following comment on this blog post (https://towardsdatascience.com/contrasting-contrastive-loss-functions-3c13ca5f055e), but I don't think that I really understood what does it mean.

"Chen et al. found that an appropriate temperature parameter can help the model learn from hard negatives. In addition, they showed that the optimal temperature differs on different batch sizes and number of training epochs."

Thanks

jimmykimmy68 commented 3 years ago

Great explanation!

vahuja4 commented 3 years ago

Hi Amit, Nice explanation! A couple of questions regarding SSL in Computer Vision:

What happens if in a training batch, a majority of the images belong to the same class. Will an SSL algorithm not fail, because it will push images (embeddings) of the same class further apart?
Also, can you please explain what causes an SSL algorithm to push images of the same class to be pushed closer to each other? Because, as far as training is concerned, only the original and its augmented versions are being pushed close to each other. What causes other images of the same class to be pushed closer to each other?

amitness commented 2 years ago

@vahuja4 Regarding your first question, I had the same curiosity. The author answered it here and say that they don't handle it explicitly. I'm guessing it's not a problem for such a large dataset since it's very rare that all same class images would end up in the same batch.

dhiren-hamal commented 2 years ago

Awesome explanation bro! Thank you for your efforts.

jessie-chen99 commented 2 years ago

Thank you so much for this clear explanation!

WuYHH commented 2 years ago

Thank u very much ! It is really benefit for my understanding with constrastive learning.

mohkuwait commented 2 years ago

you are the best of the best, nice explaining for the paper, you make it easy

MailSuesarn commented 2 years ago

It's a really great article. I have a question I want to ask you As the paper shows A large batch size will yield good results. due to the larger number of negative pairs Am I right? so that means We need a lot of memory to hold the batch. my question is If I use the multi-GPU method in this example https://keras.io/guides/distributed_training/ It will still work as if we are training with a large batch size or not? Or it doesn't just help train faster.

Thank you in advance

amitness commented 2 years ago

@MailSuesarn You're right! Though, multi-GPU training brings a new problem if you use batch normalization directly. The paper shows that there could be information leakage if batch normalization is applied locally to the small batches in each GPU.

So, instead of that, they use a "global batch normalization" where the statistics are calculated across all the images in all the GPUs. See this video.

You can search around to see if there is a easy to use implementation of that in keras

rjrobben commented 1 year ago

how do we train such network if we do not have multiple gpu?

M-Amrollahi commented 1 year ago

Thanks for your explanation, A question, When we have the similarity, why we do not use that as loss function? Why we input the similarity to softmax?

crazyboy9103 commented 1 year ago

Thank you for the clear explanation! I have one question on the Noise Contrastive Estimation(NCE) Loss, which has l[k!=i] in the denominator. According to the paper, l[k!=i] = 1 if k is not from the ith image, meaning that only negative pairs are summed up in the denominator. Please correct me if I'm wrong.

HangLuoWh commented 1 year ago

nice explanation!

darkenergy814 commented 1 year ago

Thank you for the easy explanation!

HuyTrinh212 commented 1 year ago

Very good and detailed article, but I have a problem. If my data set only has 2 labels, dog and cat, is training with batch size = 256 possible? Suppose a batch can have 128 cats and 128 dogs, so if you calculate the loss of 1 cat 'pos', does it still exclude the remaining 127 cat images?

amitness commented 1 year ago

@HuyTrinh212 If you already know the class labels, SimCLR wouldn't be a relevant model to apply. It's supposed to be used when you want to learn representations across diverse set of unlabeled images. For the cat/dog example, a simple supervised binary classification would work better.

However, your intuition on the issue you pointed out is correct. If a lot of same class images end up in the batch, it doesn't make sense to treat them as negatives. That's one of the drawbacks of the SimCLR approach. Some follow-up papers have used heuristics such as clustering images so that each batch has images from different clusters.

HuyTrinh212 commented 1 year ago

@amitness Very good explanation, thank you. i will read more about 'heuristics & clustering' to try.