Open HobbitLong opened 4 years ago
Hey Amazing work.
But I think there might be a typo in proposition 4.1, where
"(v_1, v_2) = \min_{v_1, v_2}I(v_1; v_2)"
should be
"(v_1, v_2) = \arg \min_{v_1, v_2}I(v_1; v_2)".
Thanks for your amazing work again.
Hey @WOWNICE, Thank you very much for spotting this!
I have fixed it and will update it to the next version.
Hi, thank you for the code. It is great.
Unfortunately, I can't find the part in infomin that learns views using the flow model and the min/max game. Is it implemented yet?
Hi, @alldbi,
Thanks for your interest! The view learning
experiments will be released in a separate repo later.
@HobbitLong Thank you very much!
Is there an official InfoMin configs (detail args) provided for 73.0% results? Thanks!
In addition, can you provide the speed information when train infomin? Thanks very much!
@haohang96,
Is there an official InfoMin configs (detail args) provided for 73.0% results?
Appending --epochs 800 --cosine
besides specifying --method InfoMin
In addition, can you provide the speed information when train infomin?
I did not remember the exact time but will get back if I tested. But roughly, I think it will take about two weeks to train 800 epochs with 8 Titan V100.
From the paper, RandAugmentation is used for pre-training. are the parameters of rand_augmentation inherited from the imagenet supervised classification task? or are the parameters from the parameter searching based on the task of pre-training + fine-tuning?
From the paper, RandAugmentation is used for pre-training. are the parameters of rand_augmentation inherited from the imagenet supervised classification task? or are the parameters from the parameter searching based on the task of pre-training + fine-tuning?
@amsword, I did not explicitly search it, but of course I have been inspired by the setting in the original RA paper. The way I initially got the parameter was doing a coin toss over [1, 2] for number of layer, and randomly chose a magnitude from [5, 10, 15], resulting in what you see from the code now.
Later on, I did do a quick validation on ImageNet-100 subset (e.g., trying magnitude of 5, 10, 15, 20), and it seems to me that the layer and magnitude do not matter that much, what matters is that you just need to have it. So I stick to my initial random setting. But since I didn't search over the whole ImageNet, the current setting could be sub-optimal.
Hi @HobbitLong! Thanks for sharing your code for the current and previous papers.
If I understood correctly, in the current paper (and also on CMC) when using the Ydbdr color space, the Y channel would compose the first view, while Dbdr would compose the second one. If so, I couldn't find in the code where or how the channels are split to compose each view. May you please help me? Thanks in advance!
Thanks for your excellent work! I have a question regarding the linear evaluation of infoMin. I noticed that you are using Random Augmentations (RA) in linear evaluation stage, whether the experimental results in your infoMin paper is also produced with RA? Thank you!
@xwang-icsi,
Yes. Using RA gives 72.97, while muting RA (i.e., using NULL
) gives 72.92.
Got it. Thank you for your quick response!
Also, the linear classifier is trained for 60 epochs by default, which is different from the 100 epochs training schedule in NPID and MoCo. I am not sure whether longer training schedule can introduce any further improvements?
Hi, I have some questions regarding the view learning experiments.
From my understanding, the view generator aims to minimize the MI between views, and the feature extractor aims to maximize MI. However, wouldn't this simply degenerate to the learned views having very low MI? (and so the optimal MI feature extractors can maximize is also low).
If the view generator and feature extractor have opposite objectives, what is the difference between
Thank you!
The temperature in infomin is 0.15, could you present performance of temperature=0.2 ? How big is the gap between them?
@bchao1,
From my understanding, the view generator aims to minimize the MI between views, and the feature extractor aims to maximize MI. However, wouldn't this simply degenerate to the learned views having very low MI? (and so the optimal MI feature extractors can maximize is also low).
It's possible, so there are two constraints for generator: (1) invertible, and (2) retain task-relevant info by another two classification heads.
If the view generator and feature extractor have opposite objectives, what is the difference between
- The adversarial training procedure proposed in the paper (2 pass each iteration)
- Simply adding a gradient reversal layer between the view generator and feature extractor (1 pass each iteration) ?
The training procedure is the same as GAN. How you train GAN is how you train the view learner. In other words, it's option 2.
Hi, @haohang96,
The temperature in infomin is 0.15, could you present performance of temperature=0.2 ? How big is the gap between them?
It's typically small for low epochs training, e.g. 0.1~0.4 for 100 epochs. Also it's not the point of InfoMin so I did not ablate it specifically.
Also, the linear classifier is trained for 60 epochs by default, which is different from the 100 epochs training schedule in NPID and MoCo. I am not sure whether longer training schedule can introduce any further improvements?
Hi, @xwang-icsi, it's likely to get marginal improvement.
Hi, @HobbitLong , thanks for your excellent work!
Unfortunately, I can't find the view generator
in this repo, which is mentioned in the paper "InfoMin". Could you please provide the link to the code?
Hi, @HobbitLong , thanks for your excellent work!
Unfortunately, I can't find the view generator in this repo, which is mentioned in the paper "InfoMin". Could you please provide the link to the code?
Hi, @HobbitLong, thanks for this amazing work(InfoMin)! I have a question with sec4.2, about view generator. I see this function X_head = g(X), that X is the input image and g() is a learnable flow based model, so there is no other augmentations applying to input image X (like crop, blur, color jitter...)? All is done in g()? Looking forward to your reply, and again, thanks for all your amazing works in contrastive learning !
Hi, @HobbitLong Is it possible that the learned view generator g() can transfer to other dataset? What is your opinion about transferring transformation (augmentation)? It seems like not very much study in the relation between dataset and transformation. Again, much appreciation of your study in contrastive learning and look forward to your reply.
Hi, @HobbitLong , thanks for your excellent work!
Unfortunately, I can't find the view generator
in this repo, which is mentioned in the paper "InfoMin". Could you please provide the link to the code?
hi, Yonglong:
Thanks so much for the excellent work! Is there any code that computes the mutual information or infoNCE loss from different views? Thank you.
Hi, @HobbitLong , thanks for your excellent work!
Unfortunately, I can't find the
view generator
in this repo, which is mentioned in the paper "InfoMin". Could you please provide the link to the code?
Hope the author will release the code about 'view generator', to give more about the implementation details.
Firstly, my apology that this repo currently contains much more beyond "InfoMin", and that the
view learning
experiments are not released here. I just realized that I should have a separate repo for "InfoMin" to host fun experiments. I will do this in the future.If you have specific question about "InfoMin" paper, you could leave it under this issue.