In the notebook named "How does a neural net really work?", there is a point where the parameters of a parabola are found using gradients.
There is a cell with this content
for i in range(10):
loss = quad_mae(abc)
loss.backward()
with torch.no_grad(): abc -= abc.grad*0.01
print(f'step={i}; loss={loss:.2f}')
If you run this loop for more than 10 iterations the loss starts growing again.
In the text, it's said that this is because the learning rate must be progressively decreased in practice.
In my opinion, this is because every time that loss.backward() is exectuted the gradients are "accumulated" rather than recomputed. If the gradients are reset to zero after each iteration, it converges to a minimum:
------------------------------------- Proposed code -------------------------------------------
for i in range(10):
loss = quad_mae(abc)
loss.backward()
with torch.nograd():
abc -= abc.grad*0.01
abc.grad.fill(0) #New line
print(f'step={i}; loss={loss:.2f}')
Let me conclude by congratulating you for this very clear explanation
In the notebook named "How does a neural net really work?", there is a point where the parameters of a parabola are found using gradients. There is a cell with this content
for i in range(10): loss = quad_mae(abc) loss.backward() with torch.no_grad(): abc -= abc.grad*0.01 print(f'step={i}; loss={loss:.2f}')
If you run this loop for more than 10 iterations the loss starts growing again.
In the text, it's said that this is because the learning rate must be progressively decreased in practice. In my opinion, this is because every time that loss.backward() is exectuted the gradients are "accumulated" rather than recomputed. If the gradients are reset to zero after each iteration, it converges to a minimum: ------------------------------------- Proposed code ------------------------------------------- for i in range(10): loss = quad_mae(abc) loss.backward() with torch.nograd(): abc -= abc.grad*0.01 abc.grad.fill(0) #New line print(f'step={i}; loss={loss:.2f}')
Let me conclude by congratulating you for this very clear explanation
Regards