A small question regarding accumulation

chrismbryant / backpropagation

My derivation of the backpropagation algorithm used in deep neural networks.

23 stars 9 forks source link

A small question regarding accumulation #4

Open EXJUSTICE opened 5 years ago

EXJUSTICE commented 5 years ago

So it seems I found an issue after all, but its probably a misunderstanding on my part!

You specify that the error delta(j) is calculated by taking the difference between the activation versus the y example. This makes sense as theres usually a corresponding y example for each output class. The exact quote is: we will use the backpropagation algorithm to compute an “error” 𝛿𝑗(𝑙) for each unit in each layer. In the final layer 𝐿, this error calculation is straightforward:

With this and backpropagation, the errors of earlier layers can be found, until an accumulation matrix is used. You write: To do this, we’ll build up a matrix for each layer, adding to it after each training example like so (starting with a matrix of zeros)

My question concerns the role of each unit in a layer for this accumulation. You spoke of calculating individual errors in each unit of a layer, but in the accumulation I dont see the units being mentioned? Are they subscripted as p & j ? Specifically, in your "errors in earlier layers formula", the units are not specified. I assume this is because all of the nodes are intrinsically included in a vector?

Thanks

chrismbryant commented 5 years ago

Thanks for the detailed review!

I believe your intuition is correct. The indices p and q in the section you quoted do refer to the units of a specific layer (different subscript letters reference units in different layers). In some sections, I do instead use a vector notation which absorbs the index and treats each unit in a layer as an element of a vector. I typically try to use the convention that bold lowercase variables represent vectors (which absorb one subscript index) and bold uppercase variables represent matrices (which absorb two subscript indices).

I also want to mention, however, that I was never quite satisfied with this portion of the derivation. "Building up a matrix" by adding together contributions from many examples in this way never felt 100% intuitive to me. Why add up the contributions iteratively in this weird way, and not just give delta another index corresponding to the training example? I wrote up this first document (2017_06_21) just because I wanted to make sense of the seemingly confusing notation and method that Andrew Ng presented (without ample explanation) in his Machine Learning Coursera course. Personally, I prefer the derivation I wrote up in my second document (the 2017_09_27 one) because I think the notation is clearer and the process is better motivated.

I hope that answers your question, but please let me know if you find anything else confusing.

EXJUSTICE commented 5 years ago

Thanks for the rapid response!

I see, thanks for the distinction. That makes more sense now, although I find the 1-2 notation to be more understandable personally.

I find this particular section of the course aggravating , in that Andrew often leaves out multiple derivations. A large percentage of the forum responses state that " its ok not to understand it, even Andrew didnt get it", but that just renders the point of learning moot. Hence your document is really helpful. I havent read your second document! I shall take a look.

If you dont mind, I'm going to walk through the steps here, can you correct me when I go off course? If there is a question, it's hyphened.

The overall aim is to create a direct relationship between gradient of the cost function and changing weights (theta) of the network, to allow for optimization of said cost function to achieve minimum.

Random initialization of weights.
Forward propagate. Calculate the hypothesis using all values of x, to obtain the final layer a (or h value, however you want to call it).
Calculate final layer error via delta = h-y. -This may be a single value or a vector of values for multiple classes, right?
Propagate the error backwards, creating matrixes for each layer, where individual errors for each unit are stored.
- There is no size problem here correct? If the output layer has 1 unit or 3 units, and the n-1 layer has 5 units, we can still propagate the errors accordingly I believe in the same manner, without issue?
- This is a matrix, not a vector, because each row is for a different unit, and each column a different connection?
Stop propagation after layer 2, as layer 1 has no errors.
Since we know that dJ/dTheta = a(n)*delta(n+1), we now calculate contributions for each unit in each layer. Each layer gets its own matrix, holding all of the contributions.
- The addition here refers to the initialized zero element in the matrix receiving the corresponding contribution calculation, right?
Im still not 100% sure what is the purpose of D, capital delta. does it contain the contributions for all x-values?
Calculated values to an optimizer or gradient descent?

I havent done the exercises yet, we would repeat this forward-backwards loop for m training samples correct? These weights do not change here. That comes later during the optimization, correct?

Thanks alot mate!

chrismbryant commented 5 years ago

It's been about a year and a half since I took this course, so my memory of Andrew's approach is a little hazy. But I'll try my best to remember what was going on here.

Yes, I believe that Andrew's overall aim was to create a relationship between the cost function gradient and the parameter-updating procedure, but in principle, with gradient descent this relationship should be trivial: if you have the gradient, you have your procedure. This may be a minor distinction, but in my mind, the aim is simply to differentiate the cost function that we have with respect to every parameter we want to learn. It just so happens that because a neural network is a bunch of function compositions, the gradient relies on repeated use of the chain rule. The backpropagation algorithm is our way of applying the chain rule to calculate the derivatives we want.

Yes.
Yes.
Yes, in their most general form, h and y are vector objects with as many or as few elements (including 1 element) as your problem requires.
Yes, this is the approach Andrew appears to take, with a Δ_pq matrix containing the error propagated from each unit in one layer (represented by index q) to each unit in the next layer (represented by index p). Consider the first node in layer {l}. We encapsulate the impact that this node has on all the nodes of layer {l + 1} using the first row of Δ^(l). Each subsequent node in layer {l} fills in each subsequent row of Δ^(l). The size of Δ depends on the size of the pair of layers it refers to, so each layer in general has a different sized Δ. Andrew then computes the average value of each layer's Δ_pq elements over all training examples to feed into this combined error "D_pq".
I recall Andrew saying this, and I see it in my notes, but I don't really remember why this logic makes sense. Another way to look at this step is that since Δ_pq^(l) depends on a_q^(l) and δ_p^{(l + 1)}, there simply is no Δ_pq^(l) which corresponds to a δ with a layer index any lower than 2.
This is the weird iterative step in Andrew's method that I don't think makes intuitive sense (I think there are better ways to talk about and notate the sum of error contributions). But yes, like I described in step 3 above, what has been done is essentially a sum over all m examples i for each Δ_pq^(l)(i) so that they can all be incorporated into D.
It looks like D is Andrew's way of showing that what we have just finished doing is calculating the gradient of the full cost function (as it includes both contributions from the training examples as well as regularization error).
Yes.

A single forward-backward pass is required to compute the gradient of the cost function given a set of inputs (X) and a set of parameters. After each pass, the parameters are updated, and the gradient must be recomputed. Depending on your optimization procedure, each pass can use just one example, all of your training data, or just some smaller portion of your data.

EXJUSTICE commented 5 years ago

Thanks for the in depth explanation.

You spoke of a single forward-backward pass is required to calculate the dJ/dTheta. By this you mean, to initialize them?

Just to confirm, Andrew's lecture in week 5 stated that we do the forward pass- backward pass- calculation of cost function for i= 1:m.

This means that we repeat this process over and over, to update theta through updated cost function?

EXJUSTICE commented 5 years ago

Also, would you mind elaborating a bit on the reasoning behind the double sum in the cost function? I cannot seem to wrap my head around it intuitively, I understand that its the same as calculating 10 different logistic regression classifiers, would it be good to argue that youre measuring how well your current network gradients are at classifiying each class, all at the same time?

EDIT: Figured it out and confirmed it on the forums. The produced theta matrix contains the gradients best used for classifiying for each classification class. We add all of the cost functions up together and minimize them together to prevent preferential treatment for a single class, getting an "average" optimal performance for predicting any class

chrismbryant commented 5 years ago

∂J/∂Θ is a function of the network parameters and the network inputs in the same way that the derivative of an arbitrary function f(x) depends on both what f looks like as well as what value of x you choose to evaluate the derivative at. For the neural network cost function, it is the network architecture and parameter values that tell us what the derivative of J "looks like" and a chosen set of input training examples that tell us "where to evaluate" the derivative. Randomly initializing our parameters is how we set how J and its derivative look initially, but require no forward or backward pass to "compute" (there is nothing to compute; we just randomly choose weight values). To compute the cost function's derivative, input values must be forward and backpropagated throughout the entire network once with a fixed set of parameters.

Typically, we like to vectorize our computation to speed it up, so rather than explicitly looping over each individual training vector x(i) [a vector which has elements x_j], we combine training vectors together into a "training matrix" X with elements X_ij, where j refers to the input unit and i refers to the input example. This way, we pass in moderate-sized batches of training data into the network to act on all at once. When we use minibatch stochastic gradient descent, each batch gets a single forward and backward pass to compute the relevant cost function derivative. Each time a derivative is computed, the network parameters are updated. Plain stochastic gradient descent is the variation of that optimization procedure where each minibatch includes just one training example, and batch gradient descent is the variation where each minibatch includes the entirety of the training set.

Your description of the double sum sounds right. Also, I'm not sure if it's just because Andrew's Machine Learning Coursera course is a bit old, but multiclass classification more often now uses the softmax function in the last layer instead, which is a generalization of the sigmoid function that is properly normalized to force the network output into a well-defined probability distribution (with node values summing to 1). That's just a minor point, though, since the general derivation doesn't assume any specific activation function.

EXJUSTICE commented 5 years ago

Interesting, thank you for the very detailed explanation. I remember someone mentioning that the use of vectorization saves alot of computing power, preventing the implementation of excess for-loops

Haha I just realized reading about softmax before and that it resembles what we are doing here. Very interesting.

I have some questions regarding your derivation procedures, and I believe the questions hold in both the newer (DeepLearning) and older (Machine Learning) versions. I will use the Machine Learning course text you wrote as a page reference. Apologies for extending this thread and taking your time, I really appreciate it mate.

Broadly, I dont follow how p & k are different, both in your diagram encompass the same nodes.

On page 32, first paragraph, you start solving for dJ/dTheta. Could you elaborate more on why d(Theta jp)/ d(Theta jk) is equal to delta kp? Roughly, it stands for how the pertuburation in theta of one layer affects the theta of another? Im also not sure of why the error of a layer multiplied by the activation of another evaluates to (a (k))
In your regularization component, I don't follow why the final equation has the line [1-e1e1T].

Thanks very much

chrismbryant commented 5 years ago

No worries! I'm happy to help.

Though both p and k both refer to the same layer, at any given time, p and k could be referring to two different nodes in the same layer. For instance, taking ∂z_j/∂Θ_jk requires that we expand z_j. We might guess that it should be expanded in terms of a product of Θ_jk and a_k values, summed over all k -- but since k here is a dummy index, k can be replaced with any letter we want. Since we're performing a derivative with respect to both the j and k index of Θ, here we need to replace the k in the expansion with another index (i.e. p) to decouple the derivative from the summation. In general, if we decided to take the derivative with respect to the summation over k instead of the summation over p, we would automatically be excluding all possible terms where k != p.

This would be very important if there were couplings between nodes in the same layer, but because there is no coupling, such careful consideration of indices actually doesn't matter. In the end, the partial derivative ∂Θ_jp/∂Θ_jk is 1 whenever k = p and 0 whenever k != p. It is this fact that I'm representing with the Kronecker delta symbol. Unfortunately, the Kronecker delta symbol is the same as the symbol I used for the error layer, so that could lead to some confusion (one way to tell the difference is that the Kroneker delta doesn't get a layer superscript and it takes two indices instead of one).

Let me know if that still doesn't make sense, because it certainly didn't make much sense to me the first time I tried to reason though it when deriving the proof.

Regarding the regularization component, I multiply the matrix Θ by the matrix you mentioned just as a way to vectorize the operation which turns the first column of Θ entirely into zeros. I is the square identity matrix with ones along the diagonal and zeroes everywhere else. The vector e₁ is the unit basis vector pointing entirely in direction 1. When you compute that subtraction, if I did my math right, you should get a matrix with ones in all diagonal entries except the top left entry (which is 0) with zeroes everywhere else.

EXJUSTICE commented 5 years ago

Thank you very much mate. Can I inquire about a question from the course? I wanted to understand more about the way they implement recommendation systems.

Linking someone elses github below with the ex8 file

https://github.com/tjaskula/Coursera/blob/master/Machine%20Learning/Week%209/machine-learning-ex8/ex8.pdf

on page 12, you'll see the implementation for the cost function of Xgrad(i, :). Problem is , I don't follow where we use the sum. There's a sum in the theoretical equation, but at the bottom of the page, the code implementation makes no note of it. I understand that we are dealing with matrices, at the end we produce a row vector (instead of a scalar value), but I'm not sure what we are "summing" here.

Even the theoretical implementation, which is expanded to a column vector of dJ/dx(i) at the top of the page, does not have any sums there. What exactly are we then summing?

chrismbryant commented 5 years ago

I believe that the sum is over j (the index which tells you which element in the row is being referred to), for all elements where r = 1. The notation in the derivatives equation at the top of that page is the non-vectorized version of a matrix multiplication. The code implementation at the bottom of that page implies a matrix multiplication without needing to refer to the j index.

This duality between vectorized notation and index notation is the same as in the following definition of a dot product: (u • v) = Σ_j (u_j · v_j).

EXJUSTICE commented 5 years ago

Thank you Chris,

Is there a term for this duality? I need to read up more on this and the matrix multiplication summation, definitely still.my weakness. I do remember that dot products have a cosine function between the two vectors, that's disregarded here?

I want to use a more concrete example here if you don't mind.

Assume a x(i, :) vector of size 1x4, and a theta matrix of 4x4. The resultant product would be a 1x4 vector, where each column would be the sum of product of the row vector multiplied by the column vector correct? Then we repeat for all rows of x. How would you express this in a mathematics expression?

-------- Original Message -------- Subject: Re: [chrismbryant/backpropagation] A small question regarding accumulation (#4) From: Christopher Bryant To: chrismbryant/backpropagation CC: "Xu, Yijie" ,Author

This duality between vectorized notation and index notation is the same as in the following definition of a dot product: (u • v) = Σj (uj · vj).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/chrismbryant/backpropagation/issues/4#issuecomment-447618795, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AWfWHw3-5wcKVrq6VU0xFpWZlt19rl0sks5u5dijgaJpZM4Y9zjS.