Luolc / AdaBound

An optimizer that trains as fast as Adam and as good as SGD.
https://www.luolc.com/publications/adabound/
Apache License 2.0
2.91k stars 331 forks source link

Nan loss in RCAN model #12

Open Ken1256 opened 5 years ago

Ken1256 commented 5 years ago

https://github.com/wayne391/Image-Super-Resolution/blob/master/src/models/RCAN.py

Just change optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, amsgrad=False) to optimizer = adabound.AdaBound(model.parameters(), lr=1e-4, final_lr=0.1)

Nan loss in RCAN model, but Adam work fine.

Luolc commented 5 years ago

Hi! Thanks for sharing the failure case! I will try to reproduce the result using your code. Do you know how much resource it needs for training?

Ken1256 commented 5 years ago

I find out if torch.nn.L1Loss(reduction='mean') AdaBound work fine, But torch.nn.L1Loss(reduction='sum') Nan loss. (Sorry, after double check the code I change reduction='mean' to reduction='sum', But Adam both work fine. Normally 'mean' or 'sum' should be the same.)

The resource depan img patch_size, set args.n_resgroups = 3 and args.n_resblocks = 2 will much faster and less VRAM.

Luolc commented 5 years ago

Thanks for more details.

In this case, I guess AdaBound is a little bit sensitive on RCAN model, and a final_lr of 0.1 is too large. You may try some smaller final_lr like 0.03, 0.01, 0.003, and etc. But I am not familiar with this model and cannot make sure it would work.

Ken1256 commented 5 years ago

I try optimizer = adabound.AdaBound(model.parameters(), lr=1e-4, final_lr=1e-4) still Nan loss.

Luolc commented 5 years ago

1e-4 might be too small ...

If I understand correctly, the only difference between mean and sum is a scale of N (the count of samples in a step). If AdaBound can work with mean, then reducing the learning rate with a scale of N should work too. But I am not sure whether it should be lr or final_lr or both. I just had a discussion with my schoolmates on a seminar today about which one is more important in training, the early stage or the final stage. However we haven't come out a clear answer yet. So we have to test it through experiments right now.

GreatGBL commented 5 years ago

1e-4 might be too small ...

If I understand correctly, the only difference between mean and sum is a scale of N (the count of samples in a step). If AdaBound can work with mean, then reducing the learning rate with a scale of N should work too. But I am not sure whether it should be lr or final_lr or both. I just had a discussion with my schoolmates on a seminar today about which one is more important in training, the early stage or the final stage. However we haven't come out a clear answer yet. So we have to test it through experiments right now.

Not exactly correct, suppose If dataset A has 101 data, and the batchsize is set as 10. If we setting the reduction as mean, There is no problem with this. Otherwise, once the last batch which has only one data, and it effect the learning rate

Luolc commented 5 years ago

I believe that's a very extreme case. Generally, a single step won't affect the whole training process, on expectation.

In this case, we would encounter a much less gradient once an epoch when using sum. If this does affect the training, I think the dataset is too small and SGD will fail either.

MitraTj commented 5 years ago

hi, i use torch version 0.3.1. and just I modified optimizer = optim.Adam(params, weight_decay=conf.l2, lr=lr, eps=1e-3) to optimizer = adabound.AdaBound(params, weight_decay=conf.l2, lr=lr, final_lr=0.1, eps=1e-3)

when I ran it I faced raise ImportError("torch.utils.ffi is deprecated).

Would you help? Thanks

Michael-J98 commented 4 years ago

hi, I‘m a beginner, and I have a small question about it: The adabound was inspired by gradient_clip while clipping happens on the lr rather than the gradient. So does it mean that I still need to clip the gradient before feeding it into optimizer to prevent the gradient becoming Nan?