Closed twmht closed 1 year ago
I haven't ran any experiments on cifar100. I'll look into it. I'm swamped by NeurIPS at the moment, but I should be able to look into it in June. Have you tried a wide range of weight decay values?
I have tried 1e-4 and 5e-4 for weight decay, the difference was minor.
And I have tried [1e-3,5e-4] interval for learning rate.
the accuracy interval for resnet18 I got was 72% to 74%, but best accuracy for SGDM I got was 77.6%.
by the way, you also provide decouple weight decay, I tried that but much worse, and I found the implementation was different from original paper (https://arxiv.org/abs/1711.05101)
For example,
if decay != 0 and decouple_decay:
p.data.add_(p_old, alpha=-lr*decay)
this should be
if decay != 0 and decouple_decay:
p.data.add_(p_old, alpha=-lr_multiplier*decay)
where lr_multiplier
is the decay factor of original lr. I have tried to change that and got better result but still much worse than SGDM.
Can you provide full details of the training setup, I will investigate. In particular: batch size, learning rate schedule, exact model code (pytorch's builtin resnet18 implementation?), momentum, epochs trained, data set preprocessing used (4px random crop?).
Sure.
Most of the code including model is adopted from mmclassification
We use ResNet18 to do the experiment .
batch size is set to 1024 across 4 gpus, each gpu processes 256 images at the same time. image preprocessing includes crop and padding, I believe it's adopted in most of implementations. Total Epochs is 200.
For SGDM, learning rate is set to 0.4, momentum is set to 0.9, we decay the learning rate by gamma=0.2 at 60, 120, 160 epochs, the accuracy is 77.6%. I believe we have surpassed the baseline
For MADGRAD, best learning rate I got is 2.5e-3, weight decay I have tried is 0, 1e-4, and 5e-4, For the learning rate scheduler, I have tried the same scheduler of SGDM, and I also try constant learning rate, the best accuracy I got is 74% under these settings.
Thank you for the info, I will investigate when I have time in the coming weeks.
I haven't been able to reproduce this in my current codebase, so I'll need to close the issue as I don't have time to investigate further. Sorry.
I have tuned many trials but still much worse than SGDM in cifar100.
From the paper you did not experiment CIFAR100, How about CIFAR100?