Open stevenyangyj opened 5 years ago
this is not a sensible issues. Of course you can create problems adaptive optimizers are not good at, there's no free lunch in this miserable world! This repo ist for adabound and not a general discussion of adaptive optimizers.
Hi, LeanderK. Thanks for your comment. You may misunderstand my purpose. There's no free lunch in this world indeed, so there's also no free lunch between exploration & exploitation for optimizers. Sometimes, adaptive methods actually bring good convergence speed in early stage but get worse optimization results in end training stage. I did not impugn adabound or ANY adaptive methods, just gave some my suggests: if you are going to train a NN, please first try SGD with fine-tuned hyper-parameters in order to save your EXPENSIVE GPU time. The link: "On the Convergence of Adam and Beyond" "The Marginal Value of Adaptive Gradient Methods in Machine Learning"
If you read Luo's paper, you will find the above two papers have been cited already. it's not a secret
I tested three methods in a very simple problem, and got the result as above.
Code are printed here:
import torch import torch.nn as nn import matplotlib.pyplot as plt import adabound
class Net(nn.Module):
DIM = 30 epochs = 1000 xini = (torch.ones(1, DIM) 100) opti = (torch.zeros(1, DIM) 100)
lr = 0.01 net = Net(DIM) objfun = nn.MSELoss()
loss_adab = [] loss_adam = [] loss_sgd = [] for epoch in range(epochs):
lr = 0.01 net = Net(DIM) objfun = nn.MSELoss()
for epoch in range(epochs):
lr = 0.001 net = Net(DIM) objfun = nn.MSELoss()
for epoch in range(epochs):
plt.figure() plt.plot(loss_adab, label='adabound') plt.plot(loss_adam, label='adam') plt.plot(loss_sgd, label='SGD') plt.yscale('log') plt.xlabel('epochs') plt.ylabel('Log(loss)') plt.legend() plt.savefig('camp.png', dpi=600) plt.show()