amirgholami / adahessian

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning
MIT License
265 stars 49 forks source link

Settings on ImageNet #2

Closed lld533 closed 4 years ago

lld533 commented 4 years ago

Hello,

I'm a little confused of your experimental settings on ImageNet. Could you please clairify the following questions?

1/ The initial learning rate is set to 0.15. That is to say, weight decay args.wd / args.weight_decay = 1e-4 / 0.15 on ImageNet. Is it right?

2/ Two lr schedules have been studied in this paper, i.e. the step decay schedule and the plateau based schedule but the one that leads to better result is only reported. Regarding to Fig. A.9, the plateau based schedule seems to be better than standard step decay schedule for adahessian on ImageNet. May I know the best Top-1 accuracy obtained with your method using the step decay schedule? Also, could you further share the hyper parameter settings of the plateau based schedule in PyTorch? Do you use all default hyper parameters?

Many thanks!

yaozhewei commented 4 years ago

Hi,

1/ The initial learning rate is set to 0.15. That is to say, weight decay args.wd / args.weight_decay = 1e-4 / 0.15 on ImageNet. Is it right? -- Yes, this is how we set the weight decay.

2/ Two lr schedules have been studied in this paper... -- The accuracy we got with step decay is higher than AdamW but worse than the result of adahessian with plateau decay. The reason behind this is that step decay (i.e., decay the lr by a factor of 10 at epoch 30/60) is heavily tuned for sgd optimizer.

3/ Could you further share the hyper parameter settings of the plateau based schedule. -- No, we make the patience to be 3 (we did not tune it yet and we believe if you tune this parameter, you may be able to get a better result). The exact command we use is: torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=3, verbose=True, threshold=0.001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08)

Please let us know if you have any other questions.

Best,

lld533 commented 4 years ago

Hi, 1/ The initial learning rate is set to 0.15. That is to say, weight decay args.wd / args.weight_decay = 1e-4 / 0.15 on ImageNet. Is it right? -- Yes, this is how we set the weight decay. 2/ Two lr schedules have been studied in this paper... -- The accuracy we got with step decay is higher than AdamW but worse than the result of adahessian with plateau decay. The reason behind this is that step decay (i.e., decay the lr by a factor of 10 at epoch 30/60) is heavily tuned for sgd optimizer. 3/ Could you further share the hyper parameter settings of the plateau based schedule. -- No, we make the patience to be 3 (we did not tune it yet and we believe if you tune this parameter, you may be able to get a better result). The exact command we use is: torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=3, verbose=True, threshold=0.001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08) Please let us know if you have any other questions. Best,

Great! Many thanks!