Luolc / AdaBound

An optimizer that trains as fast as Adam and as good as SGD.
https://www.luolc.com/publications/adabound/
Apache License 2.0
2.9k stars 330 forks source link

The provided new optimizer is sensitive on tiny batchsize #11

Open GreatGBL opened 5 years ago

GreatGBL commented 5 years ago

The provided new optimizer is sensitive on tiny batchsize (<4), I am testing on the very simply linear regression, while others performance looks like nice currently.

Path:

1

Loss curve: 2

Zoomed Loss curve: 3

Luolc commented 5 years ago

That's very interesting. We didn't pay attention to the impact by batch size before. Thanks for providing the new aspect to explore! 😄

Would you please provide more details of the experiments? Such as the hyperparameters of each optimizer, the scale of the dataset, etc.

GreatGBL commented 5 years ago

import torch import numpy as np import matplotlib.pyplot as plt import torch.nn.functional as F import torch.nn as nn from torch.utils.data import TensorDataset, DataLoader import math import torch from torch.optim import Optimizer

generate M data points roughly forming a line (noise added)

M = 50

theta_true = torch.Tensor([[0.5], [2]])

X = 10 * torch.rand(M, 2) - 5

X[:, 1] = 1.0

y = torch.mm(X, theta_true) + 0.3 * torch.randn(M, 1)

def mse(t1, t2):

diff = t1 - t2

return torch.sum(diff * diff) / diff.numel()

def model(x):

return X @ theta

def cost_func(theta, X, y):

pred =  torch.mm(X, theta)

diff =  pred - y

loss =  (diff**2).sum(0) / X.shape[0]

return loss

Define

batch_size = 1

num_epochs = 100

loss_fn = F.mse_loss

train_ds = TensorDataset(X, y)

train_dl = DataLoader(train_ds, batch_size, shuffle=True)

model = nn.Linear(2, 1, bias=False)

Define a utility function to train the model

def fit(num_epochs, loss_fn, opt):

model.weight.data[0][0].fill_(2.00)

model.weight.data[0][1].fill_(4.00)

Loss =[]

Theta = np.zeros(shape=(1,2,num_epochs)) 

for epoch in range(num_epochs):

    for xb,yb in train_dl:

        pred = model(xb)

        loss = loss_fn(pred, yb)

        loss.backward()

        opt.step()

        opt.zero_grad()

    Loss.append(loss_fn(model(X), y))

    Theta[:,0,epoch] = model.weight.detach().numpy()[0][0]

    Theta[:,1,epoch] = model.weight.detach().numpy()[0][1]

Loss = np.array(Loss)

return Theta, Loss

ADAM_t, ADAM = fit(num_epochs, loss_fn, torch.optim.Adam(model.parameters(), lr=1e-2) )

SGD_t, SGD = fit(num_epochs, loss_fn, torch.optim.SGD(model.parameters(), lr=1e-3, momentum=0) ) SGDM_t, SGDM = fit(num_epochs, loss_fn, torch.optim.SGD(model.parameters(), lr=1e-3, momentum=0.9) )

ADAB_t, ADAB = fit(num_epochs, loss_fn, AdaBound(model.parameters(), lr=1e-2, final_lr=0.1) )

theta_0_vals = np.linspace(-2,4,100)

theta_1_vals = np.linspace(0,4,100)

theta = torch.Tensor(len(theta_0_vals),2) J = np.zeros((len(theta_0_vals),len(theta_1_vals)))

for i, theta_0 in enumerate(theta_0_vals):

for j, theta_1 in enumerate(theta_1_vals):

    J[i,j] = cost_func(torch.Tensor([[theta_0], [theta_1]]), X, y)

xc,yc = np.meshgrid(theta_0_vals, theta_1_vals)

contours = plt.contour(xc, yc, J, 20)

plot_vals = range(0,num_epochs)

plt.plot(ADAM_t[0,0,plot_vals],ADAM_t[0,1,plot_vals],'-.',lw=2, label='Adam')

plt.plot(SGD_t[0,0,plot_vals],SGD_t[0,1,plot_vals],'-.',lw=2, label='Sgd')

plt.plot(SGDM_t[0,0,plot_vals],SGDM_t[0,1,plot_vals],'-.',lw=2, label='Sgd+momentum')

plt.plot(ADAB_t[0,0,plot_vals],ADAB_t[0,1,plot_vals],'-.',lw=2, label='AdaBound')

plt.scatter(theta_true[0].numpy(),theta_true[1].numpy(),marker='*', color='red',lw=2, label='gloal')

plt.legend(loc='lower left')

plt.figure()

plt.subplot(211)

plt.plot(range(ADAB.shape[0]),ADAB,'-.',lw=2, label='AdaBound')

plt.plot(range(ADAB.shape[0]),ADAM,'-.',lw=2, label='Adam')

plt.plot(range(ADAB.shape[0]),SGD,'-.',lw=2, label='Sgd')

plt.plot(range(ADAB.shape[0]),SGDM,'-.',lw=2, label='Sgd+momentum')

plt.subplots_adjust(top=2.92, bottom=0.12, left=0.15, right=2.95, hspace=0.2, wspace=0.35)

plt.legend(loc='upper right')

plt.subplot(212)

plt.plot(range(ADAB.shape[0]),ADAB,'-.',lw=2, label='AdaBound')

plt.plot(range(ADAB.shape[0]),ADAM,'-.',lw=2, label='Adam')

plt.plot(range(ADAB.shape[0]),SGD,'-.',lw=2, label='Sgd')

plt.plot(range(ADAB.shape[0]),SGDM,'-.',lw=2, label='Sgd+momentum')

plt.subplots_adjust(top=2.92, bottom=0.12, left=0.15, right=2.95, hspace=0.2, wspace=0.35)

plt.xlim((80, 100))

plt.ylim((0, 0.3))

plt.legend(loc='upper right')

stevenyangyj commented 5 years ago

When I used AdaBound to train a ShuffleNet V2 model with tiny batch (5-10), I met same problem. This optimizer might not be convergent.

Btw: when I used "adabound.AdaBound([{'params': part of model's params, 'lr':0...}])" to prevent some parameters be updated during training process, I would get error infomation means "cannot use lr = 0". But I can use "torch.optim.Adam([{'params': part of model's params, 'lr':0...}])" to implement this purpose.

Is this a BUG ??

felipevw commented 4 years ago

Hi, I have been training with adabound in a custom dataset and I faced similar issues with low batch sizes. The only doubt I have is that in the ReadMe, you provide a comparison graph of the different optimizers, I dont understand why the abrupt change in the epoch 150. I guess there is when the optimizers switches to SGD but why at this point? Does that mean that if I train a dataset through 1000 epochs, it will make a similar change in epoch 750?

Thank you for the help