dvgodoy / PyTorchStepByStep

Official repository of my book: "Deep Learning with PyTorch Step-by-Step: A Beginner's Guide"
https://pytorchstepbystep.com
MIT License
834 stars 310 forks source link

Chapter 01 - negative sign for gradients #21

Closed nisargvp closed 2 years ago

nisargvp commented 3 years ago

Prior to "Linear Regression in Numpy" section you do not add a negative sign in front of calculated gradients, while you do so later. I believe later is correct as gradients need to point towards the minima. Is that right?

dvgodoy commented 3 years ago

Hi,

Thank you for pointing this out! In Step 2 and 3, I used:

# Step 2
error = (yhat - y_train)
loss = (error ** 2).mean()
# Step 3
b_grad = 2 * error.mean()
w_grad = 2 * (x_train * error).mean()

But inside the loop in Cell 1.2, I used:

    error = (y_train - yhat)  # <--- THIS SHOULD BE error = (yhat - ytrain)
    # It is a regression, so it computes mean squared error (MSE)
    loss = (error ** 2).mean()

    # Step 3 - Computes gradients for both "b" and "w" parameters
    b_grad = -2 * error.mean()               # <--- flipping the "error" above will flip these signs
    w_grad = -2 * (x_train * error).mean()   # <--- flipping the "error" above will flip these signs

Sorry about the confusion... I'll be fixing the code inside the loop to match the previous code. Both versions lead to the correct values for b and w, though, since the "minus" in the gradients cancels the flipped error.

You're correct, the gradients need to point towards the minima, and that happens in the update of the parameters

    b = b - lr * b_grad
    w = w - lr * w_grad

The negative sign in fron of the learning rate is the one doing this... if we used b = b + lr * b_grad we would be maximizing the loss.

Hope it helps, and thank you again for pointing this out :-)

Best, Daniel

nisargvp commented 3 years ago

Hi Daniel,

Thanks for the nice explanation. It makes sense now, glad I could help :)

Regards, Nisarg

On Fri, May 7, 2021 at 10:58 AM Daniel Voigt Godoy @.***> wrote:

Hi,

Thank you for pointing this out! In Step 2 and 3, I used:

Step 2error = (yhat - y_train)loss = (error * 2).mean()# Step 3b_grad = 2 error.mean()w_grad = 2 (x_train error).mean()

But inside the loop in Cell 1.2, I used:

error = (y_train - yhat)  # <--- THIS SHOULD BE error = (yhat - ytrain)
# It is a regression, so it computes mean squared error (MSE)
loss = (error ** 2).mean()

# Step 3 - Computes gradients for both "b" and "w" parameters
b_grad = -2 * error.mean()                     # <--- flipping the "error" above will flip these signs
w_grad = -2 * (x_train * error).mean()   # <--- flipping the "error" above will flip these signs

Sorry about the confusion... I'll be fixing the code inside the loop to match the previous code. Both versions lead to the correct values for b and w, though, since the "minus" in the gradients cancels the flipped error.

You're correct, the gradients need to point towards the minima, and that happens in the update of the parameters

b = b - lr * b_grad
w = w - lr * w_grad

The negative sign in fron of the learning rate is the one doing this... if we used b = b + lr * b_grad we would be maximizing the loss.

Hope it helps, and thank you again for pointing this out :-)

Best, Daniel

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dvgodoy/PyTorchStepByStep/issues/21#issuecomment-834657517, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALHTMDGDFTIM5A3OLSPJDILTMQS2VANCNFSM44JDECXA .

nisargvp commented 3 years ago

HI Daniel, I also observed the same calculation in the Autograd section and that got me wondering, if calculating error as "(y_train_tensor - y_hat)" is the norm for Pytorch while the other way round might cause the gradients to explode?

Regards, Nisarg

On Fri, May 7, 2021 at 11:02 AM Nisarg Patel @.***> wrote:

Hi Daniel,

Thanks for the nice explanation. It makes sense now, glad I could help :)

Regards, Nisarg

On Fri, May 7, 2021 at 10:58 AM Daniel Voigt Godoy < @.***> wrote:

Hi,

Thank you for pointing this out! In Step 2 and 3, I used:

Step 2error = (yhat - y_train)loss = (error * 2).mean()# Step 3b_grad = 2 error.mean()w_grad = 2 (x_train error).mean()

But inside the loop in Cell 1.2, I used:

error = (y_train - yhat)  # <--- THIS SHOULD BE error = (yhat - ytrain)
# It is a regression, so it computes mean squared error (MSE)
loss = (error ** 2).mean()

# Step 3 - Computes gradients for both "b" and "w" parameters
b_grad = -2 * error.mean()                     # <--- flipping the "error" above will flip these signs
w_grad = -2 * (x_train * error).mean()   # <--- flipping the "error" above will flip these signs

Sorry about the confusion... I'll be fixing the code inside the loop to match the previous code. Both versions lead to the correct values for b and w, though, since the "minus" in the gradients cancels the flipped error.

You're correct, the gradients need to point towards the minima, and that happens in the update of the parameters

b = b - lr * b_grad
w = w - lr * w_grad

The negative sign in fron of the learning rate is the one doing this... if we used b = b + lr * b_grad we would be maximizing the loss.

Hope it helps, and thank you again for pointing this out :-)

Best, Daniel

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dvgodoy/PyTorchStepByStep/issues/21#issuecomment-834657517, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALHTMDGDFTIM5A3OLSPJDILTMQS2VANCNFSM44JDECXA .

https://www.linkedin.com/in/nisargvpatel

dvgodoy commented 3 years ago

Hi,

I realized there are more instances of error = y_train_tensor - yhat down the line. I'll change them all to error = yhat - y_train_tensor for consistency, but the results won't change, they're already correct (and that explains how these inconsistencies flew under my radar :-))

Let me demonstrate this. First, we'll compute gradients using backward and the first expression for the error:

torch.manual_seed(42)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
print(b, w)
# Step 1 - Computes our model's predicted output - forward pass
yhat = b + w * x_train_tensor
# Step 2 - Computes the loss
error = (y_train_tensor - yhat)
loss = (error ** 2).mean()
# Step 3 - Computes gradients for both "b" and "w" parameters
loss.backward()
print(b.grad, w.grad)

It will print out:

tensor([0.1940], device='cuda:0', requires_grad=True) tensor([0.1391], device='cuda:0', requires_grad=True)
tensor([-3.3881], device='cuda:0') tensor([-1.9439], device='cuda:0')

Now, let's change the error expression:

torch.manual_seed(42)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
print(b, w)
# Step 1 - Computes our model's predicted output - forward pass
yhat = b + w * x_train_tensor
# Step 2 - Computes the loss
error = (yhat - y_train_tensor)   # <-- that's the ONLY change
loss = (error ** 2).mean()
# Step 3 - Computes gradients for both "b" and "w" parameters
loss.backward()
print(b.grad, w.grad)

It prints out:

tensor([0.1940], device='cuda:0', requires_grad=True) tensor([0.1391], device='cuda:0', requires_grad=True)
tensor([-3.3881], device='cuda:0') tensor([-1.9439], device='cuda:0')

That's exactly the same thing!

When the expression is flipped, the automatic differentiation will "work the math out" as I showed in my previous comment here, and the actual values of the gradients remain the same, correct, ones. If this sounds a bit counterintuitive, in the lines of, "how come I flip the error sign and everything is the same", it's because the gradient is with respect to the LOSS, not the error :-) The loss is squared, so it doesn't matter which sign the error has (either -2 or 2, for example), the loss will be 4 in this case. The parameter update will make the "error" smaller in absolute terms because yhat will get closer to y_train_tensor.

Did it help? Again, sorry for the confusion with the signs, and thanks for helping me make it more clear for future readers :-)

Best, Daniel

nisargvp commented 3 years ago

Hi Daniel, Thanks, yeah this was helpful but I might need some additional clarification. I understand that the gradient is w.r.t loss but the partial diff's do not have any squared terms, so how come signs do not matter? Is it correct to understand that the auto_grad process "works the math out" by using a standard expression for error i.e.( y_hat - y_train) and hence the mean() and if it is the other way around, it understands a variation from the standard calculation and flips the signs?

Thanks a lot for answering my questions.

Regards, Nisarg

On Sat, May 8, 2021 at 4:04 AM Daniel Voigt Godoy @.***> wrote:

Hi,

I realized there are more instances of error = y_train_tensor - yhat down the line. I'll change them all to error = yhat - y_train_tensor for consistency, but the results won't change, they're already correct (and that explains how these inconsistencies flew under my radar :-))

Let demonstrate it. First, we'll compute gradients using backward and the first expression for the error:

torch.manual_seed(42)b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)print(b, w)# Step 1 - Computes our model's predicted output - forward passyhat = b + w * x_train_tensor# Step 2 - Computes the losserror = (y_train_tensor - yhat)loss = (error ** 2).mean()# Step 3 - Computes gradients for both "b" and "w" parametersloss.backward()print(b.grad, w.grad)

It will print out:

tensor([0.1940], device='cuda:0', requires_grad=True) tensor([0.1391], device='cuda:0', requires_grad=True) tensor([-3.3881], device='cuda:0') tensor([-1.9439], device='cuda:0')

Now, let's change the error expression:

torch.manual_seed(42)b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)print(b, w)# Step 1 - Computes our model's predicted output - forward passyhat = b + w * x_train_tensor# Step 2 - Computes the losserror = (yhat - y_train_tensor) # <-- that's the ONLY changeloss = (error ** 2).mean()# Step 3 - Computes gradients for both "b" and "w" parametersloss.backward()print(b.grad, w.grad)

It prints out:

tensor([0.1940], device='cuda:0', requires_grad=True) tensor([0.1391], device='cuda:0', requires_grad=True) tensor([-3.3881], device='cuda:0') tensor([-1.9439], device='cuda:0')

That's exactly the same thing!

When the expression is flipped, the automatic differentiation will "work the math out" as I showed in my previous comment here, and the actual values of the gradients remain the same, correct, ones. If this sounds a bit counterintuitive, in the lines of, "how come I flip the error sign and everything is the same", it's because the gradient is with respect to the LOSS, not the error :-) The loss is squared, so it doesn't matter which sign the error has (either -2 or 2, for example), the loss will be 4 in this case. The parameter update will make the "error" smaller in absolute terms because yhat will get closer to y_train_tensor.

Did it help? Again, sorry for the confusion with the signs, and thanks for helping me make it more clear for future readers :-)

Best, Daniel

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dvgodoy/PyTorchStepByStep/issues/21#issuecomment-835297190, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALHTMDC6OFQDYPPVPFJZ5SDTMULFHANCNFSM44JDECXA .

dvgodoy commented 3 years ago

Hi,

Sure thing, let's work these expressions out! It does not use any standard expression, it works with the expressions it's given. Autograd does not make any assumptions, it simply computes the gradients using the sequence of operations. Let's play the role of autograd and use the chain rule to compute the gradients for b and w, before and after flipping y_train and y_hat:

# BEFORE
error1 = (y_train - yhat)
error1 = (y_train - (b + w * x_train))
error1 = (y_train - b - w * x_train)
loss1 = (error1 ** 2).mean()

I'm using "code" to write the expressions that represent the gradients using the chain rule:

b_grad1 = (d_loss1 / d_error1) * (d_error1 / d_b)
b_grad1 = (2 * error1) * (-1)
w_grad1 = (d_loss1 / d_error1) * (d_error1 / d_w)
w_grad1 = (2 * error1) * (-x_train)

Notice that the derivatives of the error wrt to b and w have a minus sign because yhat has a minus sign in the error expression.

# AFTER
error2 = (yhat - y_train)
error2 = (b + w * x_train - y_train)
loss2 = (error2 ** 2).mean()

First, it's clear that loss1 == loss2, since it's squared. Let's use "code" to write the expressions that represent the gradients using the chain rule once again:

b_grad2 = (d_loss2 / d_error2) * (d_error2 / d_b)
b_grad2 = (2 * error2) * (1)
w_grad2 = (d_loss2 / d_error2) * (d_error2 / d_w)
w_grad2 = (2 * error2) * (x_train)

Now the minus sign is gone! So, at first glance, it looks like the gradients have the wrong sign, right? But it turns out that error2 == -error1. So, in the end, both expressions are the same. Let's work the substitution out:

So, if we replace error2 by -error1 in the second set of gradients:

b_grad2 = (2 * error2) * (1) 
        = (2 * -error1) * (1)
        = (2 * error1) * (-1)
        = b_grad1
w_grad2 = (2 * error2) * (x_train) 
        = (2 * -error1) * (x_train)
        = (2 * error1) * (-x_train)
        = w_grad1

See, they're both the same... flipping the sign only changes where the negative sign goes, either in the first derivative (d_loss/d_error) or in the second (d_error/d_b or d_error/d_w).

Hopefully the equivalence is more clear now :-)

Best, Daniel

nisargvp commented 3 years ago

Ah, this is so good. Yes, totally clear now, all it needed was some first principle thinking.

Thanks Daniel :)

On Sat, May 8, 2021 at 12:39 PM Daniel Voigt Godoy @.***> wrote:

Hi,

Sure thing, let's work these expressions out! It does not use any standard expression, it works with the expressions it's given. Autograd does not make any assumptions, it simply computes the gradients using the sequence of operations. Let's play the role of autograd and use the chain rule to compute the gradients for b and w, before and after flipping y_train and y_hat:

BEFOREerror1 = (y_train - yhat)error1 = (y_train - (b + w x_train))error1 = (y_train - b - w x_train)loss1 = (error1 ** 2).mean()

I'm using "code" to write the expressions that represent the gradients using the chain rule:

b_grad1 = (d_loss1 / d_error1) (d_error1 / d_b) b_grad1 = (2 error1) (-1) w_grad1 = (d_loss1 / d_error1) (d_error1 / d_w) w_grad1 = (2 error1) (-x_train)

Notice that the derivatives of the error wrt to b and w have a minus sign because yhat has a minus sign in the error expression.

AFTERerror2 = (yhat - y_train)error2 = (b + w * x_train - y_train)loss2 = (error2 ** 2).mean()

First, it's clear that loss1 == loss2, since it's squared. Let's use "code" to write the expressions that represent the gradients using the chain rule once again:

b_grad2 = (d_loss2 / d_error2) (d_error2 / d_b) b_grad2 = (2 error2) (1) w_grad2 = (d_loss2 / d_error2) (d_error2 / d_w) w_grad2 = (2 error2) (x_train)

Now the minus sign is gone! So, at first glance, it looks like the gradients have the wrong sign, right? But it turns out that error2 == -error1. So, in the end, both expressions are the same. Let's work the substitution out:

So, if we replace error2 by -error1 in the second set of gradients:

b_grad2 = (2 error2) (1) = (2 -error1) (1) = (2 error1) (-1) = b_grad1 w_grad2 = (2 error2) (x_train) = (2 -error1) (x_train) = (2 error1) (-x_train) = w_grad1

See, they're both the same... flipping the sign only changes where the negative sign goes, either in the first derivative (d_loss/d_error) or in the second (d_error/d_b or d_error/d_w).

Hopefully the equivalence is more clear now :-)

Best, Daniel

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dvgodoy/PyTorchStepByStep/issues/21#issuecomment-835485443, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALHTMDEXCPCV7E2UECOKC23TMWHOHANCNFSM44JDECXA .

https://www.linkedin.com/in/nisargvpatel