HobbitLong / PyContrast

PyTorch implementation of Contrastive Learning methods
1.94k stars 186 forks source link

For question and feedback about "InfoMin", please leave it here. #3

Open HobbitLong opened 4 years ago

HobbitLong commented 4 years ago

Firstly, my apology that this repo currently contains much more beyond "InfoMin", and that the view learning experiments are not released here. I just realized that I should have a separate repo for "InfoMin" to host fun experiments. I will do this in the future.

If you have specific question about "InfoMin" paper, you could leave it under this issue.

WOWNICE commented 4 years ago

Hey Amazing work.

But I think there might be a typo in proposition 4.1, where

"(v_1, v_2) = \min_{v_1, v_2}I(v_1; v_2)"

should be

"(v_1, v_2) = \arg \min_{v_1, v_2}I(v_1; v_2)".

Thanks for your amazing work again.

HobbitLong commented 4 years ago

Hey @WOWNICE, Thank you very much for spotting this!

I have fixed it and will update it to the next version.

alldbi commented 4 years ago

Hi, thank you for the code. It is great.

Unfortunately, I can't find the part in infomin that learns views using the flow model and the min/max game. Is it implemented yet?

HobbitLong commented 4 years ago

Hi, @alldbi,

Thanks for your interest! The view learning experiments will be released in a separate repo later.

alldbi commented 4 years ago

@HobbitLong Thank you very much!

haohang96 commented 4 years ago

Is there an official InfoMin configs (detail args) provided for 73.0% results? Thanks!

haohang96 commented 4 years ago

In addition, can you provide the speed information when train infomin? Thanks very much!

HobbitLong commented 4 years ago

@haohang96,

Is there an official InfoMin configs (detail args) provided for 73.0% results?

Appending --epochs 800 --cosine besides specifying --method InfoMin

In addition, can you provide the speed information when train infomin?

I did not remember the exact time but will get back if I tested. But roughly, I think it will take about two weeks to train 800 epochs with 8 Titan V100.

amsword commented 4 years ago

From the paper, RandAugmentation is used for pre-training. are the parameters of rand_augmentation inherited from the imagenet supervised classification task? or are the parameters from the parameter searching based on the task of pre-training + fine-tuning?

HobbitLong commented 4 years ago

From the paper, RandAugmentation is used for pre-training. are the parameters of rand_augmentation inherited from the imagenet supervised classification task? or are the parameters from the parameter searching based on the task of pre-training + fine-tuning?

@amsword, I did not explicitly search it, but of course I have been inspired by the setting in the original RA paper. The way I initially got the parameter was doing a coin toss over [1, 2] for number of layer, and randomly chose a magnitude from [5, 10, 15], resulting in what you see from the code now.

Later on, I did do a quick validation on ImageNet-100 subset (e.g., trying magnitude of 5, 10, 15, 20), and it seems to me that the layer and magnitude do not matter that much, what matters is that you just need to have it. So I stick to my initial random setting. But since I didn't search over the whole ImageNet, the current setting could be sub-optimal.

alceubissoto commented 4 years ago

Hi @HobbitLong! Thanks for sharing your code for the current and previous papers.

If I understood correctly, in the current paper (and also on CMC) when using the Ydbdr color space, the Y channel would compose the first view, while Dbdr would compose the second one. If so, I couldn't find in the code where or how the channels are split to compose each view. May you please help me? Thanks in advance!

HobbitLong commented 4 years ago

Hi, @alceubissoto,

For color space transferring, it's here.

For channel splitting, it's here.

Let me know if you have further question.

xwang-icsi commented 4 years ago

Thanks for your excellent work! I have a question regarding the linear evaluation of infoMin. I noticed that you are using Random Augmentations (RA) in linear evaluation stage, whether the experimental results in your infoMin paper is also produced with RA? Thank you!

HobbitLong commented 4 years ago

@xwang-icsi,

Yes. Using RA gives 72.97, while muting RA (i.e., using NULL) gives 72.92.

xwang-icsi commented 4 years ago

Got it. Thank you for your quick response!

xwang-icsi commented 4 years ago

Also, the linear classifier is trained for 60 epochs by default, which is different from the 100 epochs training schedule in NPID and MoCo. I am not sure whether longer training schedule can introduce any further improvements?

bchao1 commented 4 years ago

Hi, I have some questions regarding the view learning experiments.

From my understanding, the view generator aims to minimize the MI between views, and the feature extractor aims to maximize MI. However, wouldn't this simply degenerate to the learned views having very low MI? (and so the optimal MI feature extractors can maximize is also low).

If the view generator and feature extractor have opposite objectives, what is the difference between

  1. The adversarial training procedure proposed in the paper (2 pass each iteration)
  2. Simply adding a gradient reversal layer between the view generator and feature extractor (1 pass each iteration) ?

Thank you!

haohang96 commented 4 years ago

The temperature in infomin is 0.15, could you present performance of temperature=0.2 ? How big is the gap between them?

HobbitLong commented 4 years ago

@bchao1,

From my understanding, the view generator aims to minimize the MI between views, and the feature extractor aims to maximize MI. However, wouldn't this simply degenerate to the learned views having very low MI? (and so the optimal MI feature extractors can maximize is also low).

It's possible, so there are two constraints for generator: (1) invertible, and (2) retain task-relevant info by another two classification heads.

If the view generator and feature extractor have opposite objectives, what is the difference between

  1. The adversarial training procedure proposed in the paper (2 pass each iteration)
  2. Simply adding a gradient reversal layer between the view generator and feature extractor (1 pass each iteration) ?

The training procedure is the same as GAN. How you train GAN is how you train the view learner. In other words, it's option 2.

HobbitLong commented 4 years ago

Hi, @haohang96,

The temperature in infomin is 0.15, could you present performance of temperature=0.2 ? How big is the gap between them?

It's typically small for low epochs training, e.g. 0.1~0.4 for 100 epochs. Also it's not the point of InfoMin so I did not ablate it specifically.

HobbitLong commented 4 years ago

Also, the linear classifier is trained for 60 epochs by default, which is different from the 100 epochs training schedule in NPID and MoCo. I am not sure whether longer training schedule can introduce any further improvements?

Hi, @xwang-icsi, it's likely to get marginal improvement.

qdmy commented 3 years ago

Hi, @HobbitLong , thanks for your excellent work!

Unfortunately, I can't find the view generator in this repo, which is mentioned in the paper "InfoMin". Could you please provide the link to the code?

SakastLord commented 3 years ago

Hi, @HobbitLong , thanks for your excellent work!

Unfortunately, I can't find the view generator in this repo, which is mentioned in the paper "InfoMin". Could you please provide the link to the code?

cucutone commented 3 years ago

Hi, @HobbitLong, thanks for this amazing work(InfoMin)! I have a question with sec4.2, about view generator. I see this function X_head = g(X), that X is the input image and g() is a learnable flow based model, so there is no other augmentations applying to input image X (like crop, blur, color jitter...)? All is done in g()? Looking forward to your reply, and again, thanks for all your amazing works in contrastive learning !

cucutone commented 3 years ago

Hi, @HobbitLong Is it possible that the learned view generator g() can transfer to other dataset? What is your opinion about transferring transformation (augmentation)? It seems like not very much study in the relation between dataset and transformation. Again, much appreciation of your study in contrastive learning and look forward to your reply.

0kuang commented 3 years ago

Hi, @HobbitLong , thanks for your excellent work!

Unfortunately, I can't find the view generator in this repo, which is mentioned in the paper "InfoMin". Could you please provide the link to the code?

wqtwjt1996 commented 3 years ago

hi, Yonglong:

Thanks so much for the excellent work! Is there any code that computes the mutual information or infoNCE loss from different views? Thank you.

usr922 commented 3 years ago

Hi, @HobbitLong , thanks for your excellent work!

Unfortunately, I can't find the view generator in this repo, which is mentioned in the paper "InfoMin". Could you please provide the link to the code?

Hope the author will release the code about 'view generator', to give more about the implementation details.