Closed dddzg closed 4 years ago
The different methods (moco, swav, etc) result in networks with feature distributions (e.g., magnitudes) which can be very different. That is why we perform learning rate and weight decay grid search and find that for our network lr=0.3 gives the best performance.
Wow. Thanks for your response. Although I am still surprised that there is a 100x learning rate gap for the linear classification experiments.
This is not that surprising given that the two methods are trained with a different loss, different optimizer, different learning rate, different weight decay, etc. There is no reason that the subsequent weight distributions should match.
Thanks again for your response. Does it indicate that we should be careful with the results of the linear classification of different pre-training models? For example, Table 6 in SwAV paper, there are about a 4% and 10% top-1 gap between MoCo v2 and SwAV in Places205 and inat18. However, In our experiments, we find that the MoCo weight performs badly with low lr linear classification on ImageNet.
is the result in the linear classification in Table 6 conducted with the same lr for SwAV and MoCo?
Each method performs its own learning rate grid search to find the best learning rate.
Thank you so much!
Thanks for your awesome work. I wonder why the learning rate is so small in linear classification(0.3 in eval_linear.py)? In the linear classification of MoCo, the initial learning rate is 30 with a two-stage reduction. There is a 100x difference with this repo. Have you ever run the eval_linear.py with moco v2 weights or run swav weights with the code from MoCo? I wonder about the performance impact of the lr.