davidrosenberg / mlcourse

Machine learning course materials.
https://davidrosenberg.github.io/ml2018
568 stars 267 forks source link

Make note on gradient being row vs column vector #2

Closed davidrosenberg closed 7 years ago

davidrosenberg commented 7 years ago

Add to directional derivative note the discussion about whether the gradient is a row or a column vector. From Piazza discussion:

Is the gradient a row vector or a column vector? (and does it matter?) This is indeed a confusing issue. There are standard conventions that I will explain below, and which we will follow. But if you understand the meaning of the objects in question, it won't really matter for this class.

When we talk about the derivative of f:Rd→R, we're talking about the Jacobian matrix of f, which for a function mapping into R ends up as a matrix with a single row, which is a row vector. The gradient is then defined as the transpose of the Jacobian matrix, and thus a column vector.

In the course webpage we link to Barnes's Matrix Differentiation notes as a reference. You'll notice the notes never use the word "gradient". Indeed, everything he writes there is about the derivative (i.e. the Jacobian). This is fine, as the gradient is just going to be the transpose of the relevant Jacobian.

Now an annoying thing: the other document on the website, simply called Appendix F: Matrix Calculus, uses the reverse convention. They define the Jacobian as the transpose of the one I've defined above and which I've found to be the standard one. Once you realize the difference is just a transpose, it's not a big deal. But it can certainly be confusing at first...

I recently found this nice website that describes how to find derivatives, but it also mentions the gradient as an aside: http://michael.orlitzky.com/articles/the_derivative_of_a_quadratic_form.php

So now -- does it matter? Well, to some people, of course it matters. But in this couse, we have two primary uses for the gradient: Find the directional derivative in a particular direction. To do this, we only need to take the inner product of the gradient with the direction. If you have a row vector (i.e. the Jacobian) instead of a column vector (the gradient), it's still pretty clear what you're supposed to do. In fact, when you're programming, row and column vectors are often just represented as "vectors" rather than matrices that happen to have only 1 column or 1 row. You then just keep track yourself of whether it's a row or a column vector. Equating the gradient to zero to find the critical points. Again, here it doesn't matter at all if you have a row or column vector (i.e. if you're working with the gradient or the derivative).

davidrosenberg commented 7 years ago

Put this into the course FAQ with commit a2294674e8edf98a293b022eb1f00229977a3c50.