Open EXJUSTICE opened 5 years ago
Thanks for the detailed review!
I believe your intuition is correct. The indices p and q in the section you quoted do refer to the units of a specific layer (different subscript letters reference units in different layers). In some sections, I do instead use a vector notation which absorbs the index and treats each unit in a layer as an element of a vector. I typically try to use the convention that bold lowercase variables represent vectors (which absorb one subscript index) and bold uppercase variables represent matrices (which absorb two subscript indices).
I also want to mention, however, that I was never quite satisfied with this portion of the derivation. "Building up a matrix" by adding together contributions from many examples in this way never felt 100% intuitive to me. Why add up the contributions iteratively in this weird way, and not just give delta another index corresponding to the training example? I wrote up this first document (2017_06_21) just because I wanted to make sense of the seemingly confusing notation and method that Andrew Ng presented (without ample explanation) in his Machine Learning Coursera course. Personally, I prefer the derivation I wrote up in my second document (the 2017_09_27 one) because I think the notation is clearer and the process is better motivated.
I hope that answers your question, but please let me know if you find anything else confusing.
Thanks for the rapid response!
I see, thanks for the distinction. That makes more sense now, although I find the 1-2 notation to be more understandable personally.
I find this particular section of the course aggravating , in that Andrew often leaves out multiple derivations. A large percentage of the forum responses state that " its ok not to understand it, even Andrew didnt get it", but that just renders the point of learning moot. Hence your document is really helpful. I havent read your second document! I shall take a look.
If you dont mind, I'm going to walk through the steps here, can you correct me when I go off course? If there is a question, it's hyphened.
The overall aim is to create a direct relationship between gradient of the cost function and changing weights (theta) of the network, to allow for optimization of said cost function to achieve minimum.
Random initialization of weights.
Forward propagate. Calculate the hypothesis using all values of x, to obtain the final layer a (or h value, however you want to call it).
Calculate final layer error via delta = h-y. -This may be a single value or a vector of values for multiple classes, right?
Propagate the error backwards, creating matrixes for each layer, where individual errors for each unit are stored.
Stop propagation after layer 2, as layer 1 has no errors.
Since we know that dJ/dTheta = a(n)*delta(n+1), we now calculate contributions for each unit in each layer. Each layer gets its own matrix, holding all of the contributions.
Im still not 100% sure what is the purpose of D, capital delta. does it contain the contributions for all x-values?
Calculated values to an optimizer or gradient descent?
I havent done the exercises yet, we would repeat this forward-backwards loop for m training samples correct? These weights do not change here. That comes later during the optimization, correct?
Thanks alot mate!
It's been about a year and a half since I took this course, so my memory of Andrew's approach is a little hazy. But I'll try my best to remember what was going on here.
Yes, I believe that Andrew's overall aim was to create a relationship between the cost function gradient and the parameter-updating procedure, but in principle, with gradient descent this relationship should be trivial: if you have the gradient, you have your procedure. This may be a minor distinction, but in my mind, the aim is simply to differentiate the cost function that we have with respect to every parameter we want to learn. It just so happens that because a neural network is a bunch of function compositions, the gradient relies on repeated use of the chain rule. The backpropagation algorithm is our way of applying the chain rule to calculate the derivatives we want.
A single forward-backward pass is required to compute the gradient of the cost function given a set of inputs (X) and a set of parameters. After each pass, the parameters are updated, and the gradient must be recomputed. Depending on your optimization procedure, each pass can use just one example, all of your training data, or just some smaller portion of your data.
Thanks for the in depth explanation.
You spoke of a single forward-backward pass is required to calculate the dJ/dTheta. By this you mean, to initialize them?
Just to confirm, Andrew's lecture in week 5 stated that we do the forward pass- backward pass- calculation of cost function for i= 1:m.
This means that we repeat this process over and over, to update theta through updated cost function?
Also, would you mind elaborating a bit on the reasoning behind the double sum in the cost function? I cannot seem to wrap my head around it intuitively, I understand that its the same as calculating 10 different logistic regression classifiers, would it be good to argue that youre measuring how well your current network gradients are at classifiying each class, all at the same time?
EDIT: Figured it out and confirmed it on the forums. The produced theta matrix contains the gradients best used for classifiying for each classification class. We add all of the cost functions up together and minimize them together to prevent preferential treatment for a single class, getting an "average" optimal performance for predicting any class
∂J/∂Θ is a function of the network parameters and the network inputs in the same way that the derivative of an arbitrary function f(x) depends on both what f looks like as well as what value of x you choose to evaluate the derivative at. For the neural network cost function, it is the network architecture and parameter values that tell us what the derivative of J "looks like" and a chosen set of input training examples that tell us "where to evaluate" the derivative. Randomly initializing our parameters is how we set how J and its derivative look initially, but require no forward or backward pass to "compute" (there is nothing to compute; we just randomly choose weight values). To compute the cost function's derivative, input values must be forward and backpropagated throughout the entire network once with a fixed set of parameters.
Typically, we like to vectorize our computation to speed it up, so rather than explicitly looping over each individual training vector x(i) [a vector which has elements xj], we combine training vectors together into a "training matrix" X with elements Xij, where j refers to the input unit and i refers to the input example. This way, we pass in moderate-sized batches of training data into the network to act on all at once. When we use minibatch stochastic gradient descent, each batch gets a single forward and backward pass to compute the relevant cost function derivative. Each time a derivative is computed, the network parameters are updated. Plain stochastic gradient descent is the variation of that optimization procedure where each minibatch includes just one training example, and batch gradient descent is the variation where each minibatch includes the entirety of the training set.
Your description of the double sum sounds right. Also, I'm not sure if it's just because Andrew's Machine Learning Coursera course is a bit old, but multiclass classification more often now uses the softmax function in the last layer instead, which is a generalization of the sigmoid function that is properly normalized to force the network output into a well-defined probability distribution (with node values summing to 1). That's just a minor point, though, since the general derivation doesn't assume any specific activation function.
Interesting, thank you for the very detailed explanation. I remember someone mentioning that the use of vectorization saves alot of computing power, preventing the implementation of excess for-loops
Haha I just realized reading about softmax before and that it resembles what we are doing here. Very interesting.
I have some questions regarding your derivation procedures, and I believe the questions hold in both the newer (DeepLearning) and older (Machine Learning) versions. I will use the Machine Learning course text you wrote as a page reference. Apologies for extending this thread and taking your time, I really appreciate it mate.
Broadly, I dont follow how p & k are different, both in your diagram encompass the same nodes.
On page 32, first paragraph, you start solving for dJ/dTheta. Could you elaborate more on why d(Theta jp)/ d(Theta jk) is equal to delta kp? Roughly, it stands for how the pertuburation in theta of one layer affects the theta of another? Im also not sure of why the error of a layer multiplied by the activation of another evaluates to (a (k))
In your regularization component, I don't follow why the final equation has the line [1-e1e1T].
Thanks very much
No worries! I'm happy to help.
Though both p and k both refer to the same layer, at any given time, p and k could be referring to two different nodes in the same layer. For instance, taking ∂zj/∂Θjk requires that we expand zj. We might guess that it should be expanded in terms of a product of Θjk and ak values, summed over all k -- but since k here is a dummy index, k can be replaced with any letter we want. Since we're performing a derivative with respect to both the j and k index of Θ, here we need to replace the k in the expansion with another index (i.e. p) to decouple the derivative from the summation. In general, if we decided to take the derivative with respect to the summation over k instead of the summation over p, we would automatically be excluding all possible terms where k != p.
This would be very important if there were couplings between nodes in the same layer, but because there is no coupling, such careful consideration of indices actually doesn't matter. In the end, the partial derivative ∂Θjp/∂Θjk is 1 whenever k = p and 0 whenever k != p. It is this fact that I'm representing with the Kronecker delta symbol. Unfortunately, the Kronecker delta symbol is the same as the symbol I used for the error layer, so that could lead to some confusion (one way to tell the difference is that the Kroneker delta doesn't get a layer superscript and it takes two indices instead of one).
Let me know if that still doesn't make sense, because it certainly didn't make much sense to me the first time I tried to reason though it when deriving the proof.
Regarding the regularization component, I multiply the matrix Θ by the matrix you mentioned just as a way to vectorize the operation which turns the first column of Θ entirely into zeros. I is the square identity matrix with ones along the diagonal and zeroes everywhere else. The vector e1 is the unit basis vector pointing entirely in direction 1. When you compute that subtraction, if I did my math right, you should get a matrix with ones in all diagonal entries except the top left entry (which is 0) with zeroes everywhere else.
Thank you very much mate. Can I inquire about a question from the course? I wanted to understand more about the way they implement recommendation systems.
Linking someone elses github below with the ex8 file
on page 12, you'll see the implementation for the cost function of Xgrad(i, :). Problem is , I don't follow where we use the sum. There's a sum in the theoretical equation, but at the bottom of the page, the code implementation makes no note of it. I understand that we are dealing with matrices, at the end we produce a row vector (instead of a scalar value), but I'm not sure what we are "summing" here.
Even the theoretical implementation, which is expanded to a column vector of dJ/dx(i) at the top of the page, does not have any sums there. What exactly are we then summing?
I believe that the sum is over j (the index which tells you which element in the row is being referred to), for all elements where r = 1. The notation in the derivatives equation at the top of that page is the non-vectorized version of a matrix multiplication. The code implementation at the bottom of that page implies a matrix multiplication without needing to refer to the j index.
This duality between vectorized notation and index notation is the same as in the following definition of a dot product: (u • v) = Σj (uj · vj).
Thank you Chris,
Is there a term for this duality? I need to read up more on this and the matrix multiplication summation, definitely still.my weakness. I do remember that dot products have a cosine function between the two vectors, that's disregarded here?
I want to use a more concrete example here if you don't mind.
Assume a x(i, :) vector of size 1x4, and a theta matrix of 4x4. The resultant product would be a 1x4 vector, where each column would be the sum of product of the row vector multiplied by the column vector correct? Then we repeat for all rows of x. How would you express this in a mathematics expression?
-------- Original Message -------- Subject: Re: [chrismbryant/backpropagation] A small question regarding accumulation (#4) From: Christopher Bryant To: chrismbryant/backpropagation CC: "Xu, Yijie" ,Author
I believe that the sum is over j (the index which tells you which element in the row is being referred to), for all elements where r = 1. The notation in the derivatives equation at the top of that page is the non-vectorized version of a matrix multiplication. The code implementation at the bottom of that page implies a matrix multiplication without needing to refer to the j index.
This duality between vectorized notation and index notation is the same as in the following definition of a dot product: (u • v) = Σj (uj · vj).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/chrismbryant/backpropagation/issues/4#issuecomment-447618795, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AWfWHw3-5wcKVrq6VU0xFpWZlt19rl0sks5u5dijgaJpZM4Y9zjS.
So it seems I found an issue after all, but its probably a misunderstanding on my part!
You specify that the error delta(j) is calculated by taking the difference between the activation versus the y example. This makes sense as theres usually a corresponding y example for each output class. The exact quote is: we will use the backpropagation algorithm to compute an “error” 𝛿𝑗(𝑙) for each unit in each layer. In the final layer 𝐿, this error calculation is straightforward:
With this and backpropagation, the errors of earlier layers can be found, until an accumulation matrix is used. You write: To do this, we’ll build up a matrix for each layer, adding to it after each training example like so (starting with a matrix of zeros)
My question concerns the role of each unit in a layer for this accumulation. You spoke of calculating individual errors in each unit of a layer, but in the accumulation I dont see the units being mentioned? Are they subscripted as p & j ? Specifically, in your "errors in earlier layers formula", the units are not specified. I assume this is because all of the nodes are intrinsically included in a vector?
Thanks