question about learning rate

GeorgeCazenavette / mtt-distillation

Official code for our CVPR '22 paper "Dataset Distillation by Matching Training Trajectories"

https://georgecazenavette.github.io/mtt-distillation/

Other

395 stars 55 forks source link

question about learning rate #24

Closed harrylee999 closed 1 year ago

harrylee999 commented 1 year ago

HI, sorry to bother you I want to know why the lr_img learning rate is set to 1000,How did you determine 1000？ Because usually the learning rate is used 0.1, 0.01, 0.001。

and if i just change lr_img to 100 or smaller value, then the loss became nan.

Can you tell me how it works and give some advice about setting Hyperparameter? (i know The Hyperparameters given now can reproduce the effect in the paper. but i want to use other network to distill datasets, i think The hyperparameters must be modified when using different networks.)

GeorgeCazenavette commented 1 year ago

Hello :)

We chose that learning rate just by searching over a coarse grid of 10^n.

I wouldn't put too much stock in what a learning rate usually is since a good learning rate is dependent on the scales of a bunch of different moving parts.

For your example, are you saying that your code works for 1000 but not 100?

I'm not sure why lowering the image lr would do that unless the lr lr is maybe too high?

If you're trying to distill a new dataset, try to make sure that your expert trajectories are good; the training loss should still be consistently decreasing at each checkpoint you're using.

maple-zhou commented 1 year ago

Hi, sorry to bother. I appreciate that this is a wonderful work, but I'm wondering how do you implement the grid search on the hyper parameters? Since I notice that there are a number of parameters needing to be set, I think it would take quite a few time to get a good setting. Thanks!