Thanks for this great work. Recently, we tried to train ResNext-50 on ImageNet classification using AdaHessian. The implementation we used is from https://github.com/davda54/ada-hessian.
However, I got some wired observations. Please see the training log:
We see that at the first 6 epochs, AdaHessian worked well. But from the 7th epoch, the training loss still decreased normally. But the test lost increased and the test accuracy declined, rapidly. We have tried several hyper-parameters and different random seeds, but this always happens.
We provided the details of our setting below for your reference.
The implementation of ResNext-50 is the standard one in PyTorch. The training is performed across 8 V100 GPUs, with total batch size 256 (32 per GPU).
We have tried to search the hyper-parameters: lr in {0.1, 0.15}, eps in {1e-2, 1e-4}, weight decay in {1e-4, 2e-4, 4e-4, 8e-4, 1e-3}. For other hyper-parameters, we used the default values.
We also applied linear warmup of the learning rate at the first 100 steps, otherwise AdaHessian crashed at the beginning of model training.
Hi,
Thanks for this great work. Recently, we tried to train ResNext-50 on ImageNet classification using AdaHessian. The implementation we used is from https://github.com/davda54/ada-hessian.
However, I got some wired observations. Please see the training log:
We see that at the first 6 epochs, AdaHessian worked well. But from the 7th epoch, the training loss still decreased normally. But the test lost increased and the test accuracy declined, rapidly. We have tried several hyper-parameters and different random seeds, but this always happens.
We provided the details of our setting below for your reference. The implementation of ResNext-50 is the standard one in PyTorch. The training is performed across 8 V100 GPUs, with total batch size 256 (32 per GPU). We have tried to search the hyper-parameters:
lr in {0.1, 0.15}, eps in {1e-2, 1e-4}, weight decay in {1e-4, 2e-4, 4e-4, 8e-4, 1e-3}
. For other hyper-parameters, we used the default values. We also applied linear warmup of the learning rate at the first 100 steps, otherwise AdaHessian crashed at the beginning of model training.