dvgodoy / PyTorchStepByStep

Official repository of my book: "Deep Learning with PyTorch Step-by-Step: A Beginner's Guide"
https://pytorchstepbystep.com
MIT License
866 stars 332 forks source link

On the calculation of "w_grad". #10

Closed minertom closed 3 years ago

minertom commented 3 years ago

In chapter0 and chapter1 there are a couple of calculations for the gradient of b and the gradient of w.

Given that yhat = b +w(x_train), where b is random and error is (yhat -y_train), that makes error = (b +w(x_train) - y_train).

b_grad is given as 2(error.mean() ) which is 2(b + w(x_train) - y_train).mean() , so it seems to me, and I could be wrong, that the so called gradient of b also includes a healthy helping of w.

w_grad is given as 2(x_train(error).mean() ) which expands out to 2((x_train)b + (x_train**2)w - (x_train)(y_train)).mean() )

It is the (x_train)**2 term that triggered something in my mind. As well as the fact that the w_grad term also has a healthy helping of b.

My intuition, and I could be wrong here, is that there is a partial derivative missing such that the gradient of b would be based upon differentiating b while holding w constant and similarly, the gradient of w would be done with holding b as a constant.

Also, the (x_train)**2 term is confusing here.

I would be deeply grateful for a clarification.

Thank You Tom

dvgodoy commented 3 years ago

Hi Tom,

In both cases, the gradient is indeed computed holding the other variable constant. So, for grad_b, w is kept constant, and vice-versa. Your expansions are fine, but the resulting terms, where "w" appears in the computation of b_grad, and both "b" and the quadratic term (x**2) showing up in w_grad are just the result of choosing MSE (mean squared error) as the loss function.

Since the error is quadratic, the derivatives include the error (yhat - ytrain) due to the use of the chain rule. This makes the "other" variable to show up in the gradient, since it is needed to compute yhat. The chain rule also produces the quadratic term.

If it wasn't for the "squaring" of the error, the derivatives would be more straightforward: b_grad would be a constant (1), and w_grad would be just "x". It makes sense intuitively, because a small change in either one of the variables would be reflected in the loss directly (for b) or mediated by x (for w).

I hope it helps! Best, Daniel

minertom commented 3 years ago

Got it! Thanks.

I clearly did not pay enough for your book. I am learning a lot from it. 😁

Regards Tom

On Wed, Nov 25, 2020 at 10:36 AM Daniel Voigt Godoy < notifications@github.com> wrote:

Hi Tom,

In both cases, the gradient is indeed computed holding the other variable constant. So, for grad_b, w is kept constant, and vice-versa. Your expansions are fine, but the resulting terms, where "w" appears in the computation of b_grad, and both "b" showing up in w_grad, as well as the quadratic term (x**2) are just the result of choosing MSE (mean squared error) as the loss function.

Since the error is quadratic, the derivatives include the error (yhat - ytrain) due to the use of the chain rule. This makes the "other" variable to show up in the gradient, since it is needed to compute yhat. The chain rule also produces the quadratic term.

If it wasn't for the "squaring" of the error, the derivatives would be more straightforward: b_grad would be a constant (1), and w_grad would be just "x". It makes sense intuitively, because a small change in either one of the variables would be reflected in the loss directly (for b) or mediated by x (for w).

I hope it helps! Best, Daniel

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dvgodoy/PyTorchStepByStep/issues/10#issuecomment-733882988, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADHGGHESXLFTC6QXFPQGIC3SRVFDNANCNFSM4UBHONOA .