dvgodoy / PyTorchStepByStep

Official repository of my book: "Deep Learning with PyTorch Step-by-Step: A Beginner's Guide"
https://pytorchstepbystep.com
MIT License
834 stars 310 forks source link

Clarification in Adam: why 19/1999-period MA? #30

Closed scmanjarrez closed 1 year ago

scmanjarrez commented 1 year ago

Hi, First of all, congratulations for this amazing book.

Do you mind explaining why Adam uses betas for 19/1999-period moving average?

dvgodoy commented 1 year ago

Hi @scmanjarrez

Thank you :-) The betas 0.9 (corresponding to 19-period MA) and 0.999 (corresponding to 1999-period MA) used by Adam in its EWMA of the gradients are simply the suggested defaults in the original paper Adam: A Method for Stochastic Optimization.

Roughly speaking, the general idea is to use a short-term average to smooth the gradients, so they are not so erratic, while preserving their relative magnitude and momentum. For scaling, though, it's better to use a long-term average instead, as if we were "normalizing" the gradients for the task at hand by figuring their overall order of magnitude. Then, we're left with the choice of periods (or betas) for the two averages. The authors (see page 8 of the paper) experimented with 0 (no average) and 0.9 (19-period) for beta1 and 0.99 (199-period), 0.999 (1999-period), and 0.9999 (19999-period) for beta2 and concluded that "...good default settings for the tested machine learning problems are alpha = 0.001, beta1 = 0.9, beta2 = 0.999 and epsilon = 1e−8."

I'd suggest you to experiment with different values - in Chapter06.ipynb, in the "Visualizing Adapted Gradients" section, you can experiment with different settings for the Adam optimizer and observe the results in the figure17 plot. For example, remove smoothing from the gradients using beta1=0 (resulting in very flat because we're still scaling and momentum is lost): optimizer = optim.Adam(model.parameters(), lr=0.1, betas=[0, 0.999]) Or use raw gradients for scaling using beta2=0 (resulting in wildly varying gradients because we have momentum but no scaling): optimizer = optim.Adam(model3.parameters(), lr=0.1, betas=[0.9, 0])

Even better, try changing the settings in the plot of the paths (in the same section of the notebook) and see how they are different :-)

# Generating data for the plots
torch.manual_seed(42)
model = nn.Sequential()
model.add_module('linear', nn.Linear(1, 1))
loss_fn = nn.MSELoss(reduction='mean')

optimizers = {'SGD': {'class': optim.SGD, 'parms': {'lr': 0.1}}, 
             'Adam': {'class': optim.Adam, 'parms': {'lr': 0.1, 'betas': [0, 0.999]}}}
results = compare_optimizers(model, loss_fn, optimizers, train_loader, val_loader, n_epochs=10)

b, w, bs, ws, all_losses = contour_data(x_tensor, y_tensor)
fig = plot_paths(results, b, w, bs, ws, all_losses)

Try it out, it's fun, and I hope it helps :-) Best, Daniel

scmanjarrez commented 1 year ago

Thank you very much Daniel, I had checked the paper, but didn't understand very well these plots. I'll play with figure17 function and betas to see myself how it affects the plots.