Open Ken1256 opened 5 years ago
Hi! Thanks for sharing the failure case! I will try to reproduce the result using your code. Do you know how much resource it needs for training?
I find out if torch.nn.L1Loss(reduction='mean')
AdaBound work fine, But torch.nn.L1Loss(reduction='sum')
Nan loss. (Sorry, after double check the code I change reduction='mean' to reduction='sum', But Adam both work fine. Normally 'mean' or 'sum' should be the same.)
The resource depan img patch_size, set args.n_resgroups = 3 and args.n_resblocks = 2 will much faster and less VRAM.
Thanks for more details.
In this case, I guess AdaBound is a little bit sensitive on RCAN model, and a final_lr
of 0.1 is too large. You may try some smaller final_lr
like 0.03, 0.01, 0.003, and etc. But I am not familiar with this model and cannot make sure it would work.
I try optimizer = adabound.AdaBound(model.parameters(), lr=1e-4, final_lr=1e-4)
still Nan loss.
1e-4 might be too small ...
If I understand correctly, the only difference between mean
and sum
is a scale of N (the count of samples in a step). If AdaBound can work with mean
, then reducing the learning rate with a scale of N should work too. But I am not sure whether it should be lr
or final_lr
or both. I just had a discussion with my schoolmates on a seminar today about which one is more important in training, the early stage or the final stage. However we haven't come out a clear answer yet. So we have to test it through experiments right now.
1e-4 might be too small ...
If I understand correctly, the only difference between
mean
andsum
is a scale of N (the count of samples in a step). If AdaBound can work withmean
, then reducing the learning rate with a scale of N should work too. But I am not sure whether it should belr
orfinal_lr
or both. I just had a discussion with my schoolmates on a seminar today about which one is more important in training, the early stage or the final stage. However we haven't come out a clear answer yet. So we have to test it through experiments right now.
Not exactly correct, suppose If dataset A has 101 data, and the batchsize is set as 10. If we setting the reduction as mean, There is no problem with this. Otherwise, once the last batch which has only one data, and it effect the learning rate
I believe that's a very extreme case. Generally, a single step won't affect the whole training process, on expectation.
In this case, we would encounter a much less gradient once an epoch when using sum
. If this does affect the training, I think the dataset is too small and SGD will fail either.
hi, i use torch version 0.3.1. and just I modified optimizer = optim.Adam(params, weight_decay=conf.l2, lr=lr, eps=1e-3) to optimizer = adabound.AdaBound(params, weight_decay=conf.l2, lr=lr, final_lr=0.1, eps=1e-3)
when I ran it I faced raise ImportError("torch.utils.ffi is deprecated).
Would you help? Thanks
hi, I‘m a beginner, and I have a small question about it: The adabound was inspired by gradient_clip while clipping happens on the lr rather than the gradient. So does it mean that I still need to clip the gradient before feeding it into optimizer to prevent the gradient becoming Nan?
https://github.com/wayne391/Image-Super-Resolution/blob/master/src/models/RCAN.py
Just change
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, amsgrad=False)
tooptimizer = adabound.AdaBound(model.parameters(), lr=1e-4, final_lr=0.1)
Nan loss in RCAN model, but Adam work fine.