facebookresearch / madgrad

MADGRAD Optimization Method
MIT License
802 stars 57 forks source link

How about CIFAR100? #14

Closed twmht closed 1 year ago

twmht commented 2 years ago

I have tuned many trials but still much worse than SGDM in cifar100.

From the paper you did not experiment CIFAR100, How about CIFAR100?

adefazio commented 2 years ago

I haven't ran any experiments on cifar100. I'll look into it. I'm swamped by NeurIPS at the moment, but I should be able to look into it in June. Have you tried a wide range of weight decay values?

twmht commented 2 years ago

I have tried 1e-4 and 5e-4 for weight decay, the difference was minor.

And I have tried [1e-3,5e-4] interval for learning rate.

the accuracy interval for resnet18 I got was 72% to 74%, but best accuracy for SGDM I got was 77.6%.

by the way, you also provide decouple weight decay, I tried that but much worse, and I found the implementation was different from original paper (https://arxiv.org/abs/1711.05101)

For example,

if decay != 0 and decouple_decay:
    p.data.add_(p_old, alpha=-lr*decay)

this should be

if decay != 0 and decouple_decay:
    p.data.add_(p_old, alpha=-lr_multiplier*decay)

where lr_multiplier is the decay factor of original lr. I have tried to change that and got better result but still much worse than SGDM.

adefazio commented 2 years ago

Can you provide full details of the training setup, I will investigate. In particular: batch size, learning rate schedule, exact model code (pytorch's builtin resnet18 implementation?), momentum, epochs trained, data set preprocessing used (4px random crop?).

twmht commented 2 years ago

Sure.

Most of the code including model is adopted from mmclassification

We use ResNet18 to do the experiment .

batch size is set to 1024 across 4 gpus, each gpu processes 256 images at the same time. image preprocessing includes crop and padding, I believe it's adopted in most of implementations. Total Epochs is 200.

For SGDM, learning rate is set to 0.4, momentum is set to 0.9, we decay the learning rate by gamma=0.2 at 60, 120, 160 epochs, the accuracy is 77.6%. I believe we have surpassed the baseline

For MADGRAD, best learning rate I got is 2.5e-3, weight decay I have tried is 0, 1e-4, and 5e-4, For the learning rate scheduler, I have tried the same scheduler of SGDM, and I also try constant learning rate, the best accuracy I got is 74% under these settings.

adefazio commented 2 years ago

Thank you for the info, I will investigate when I have time in the coming weeks.

adefazio commented 1 year ago

I haven't been able to reproduce this in my current codebase, so I'll need to close the issue as I don't have time to investigate further. Sorry.