Closed lld533 closed 4 years ago
Hi,
1/ The initial learning rate is set to 0.15. That is to say, weight decay args.wd / args.weight_decay = 1e-4 / 0.15 on ImageNet. Is it right? -- Yes, this is how we set the weight decay.
2/ Two lr schedules have been studied in this paper... -- The accuracy we got with step decay is higher than AdamW but worse than the result of adahessian with plateau decay. The reason behind this is that step decay (i.e., decay the lr by a factor of 10 at epoch 30/60) is heavily tuned for sgd optimizer.
3/ Could you further share the hyper parameter settings of the plateau based schedule. -- No, we make the patience to be 3 (we did not tune it yet and we believe if you tune this parameter, you may be able to get a better result). The exact command we use is: torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=3, verbose=True, threshold=0.001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08)
Please let us know if you have any other questions.
Best,
Hi, 1/ The initial learning rate is set to 0.15. That is to say, weight decay args.wd / args.weight_decay = 1e-4 / 0.15 on ImageNet. Is it right? -- Yes, this is how we set the weight decay. 2/ Two lr schedules have been studied in this paper... -- The accuracy we got with step decay is higher than AdamW but worse than the result of adahessian with plateau decay. The reason behind this is that step decay (i.e., decay the lr by a factor of 10 at epoch 30/60) is heavily tuned for sgd optimizer. 3/ Could you further share the hyper parameter settings of the plateau based schedule. -- No, we make the patience to be 3 (we did not tune it yet and we believe if you tune this parameter, you may be able to get a better result). The exact command we use is: torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=3, verbose=True, threshold=0.001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08) Please let us know if you have any other questions. Best,
Great! Many thanks!
Hello,
I'm a little confused of your experimental settings on ImageNet. Could you please clairify the following questions?
1/ The initial learning rate is set to 0.15. That is to say, weight decay args.wd / args.weight_decay = 1e-4 / 0.15 on ImageNet. Is it right?
2/ Two lr schedules have been studied in this paper, i.e. the step decay schedule and the plateau based schedule but the one that leads to better result is only reported. Regarding to Fig. A.9, the plateau based schedule seems to be better than standard step decay schedule for adahessian on ImageNet. May I know the best Top-1 accuracy obtained with your method using the step decay schedule? Also, could you further share the hyper parameter settings of the plateau based schedule in PyTorch? Do you use all default hyper parameters?
Many thanks!