Issue about backpropagation

jiawei357 commented 8 years ago

Hi, is there any specific reason that you did the back propagation one layer at a time?

fzliu commented 8 years ago

Since we can't write custom GPU layers in Caffe using Python, the only way to compute losses and gradients at certain layers is to grab the activations and compute them using numpy. If you'd like faster backprop, you can try the gram-layer branch, which does the full forward & backward pass on the GPU, but requires an extra Caffe layer written in C++.

hermitman commented 8 years ago

I have a question that may just be due to my lack of programming knowledge.

In line 191 and 198, the grad variable i updated with the computed gradient. However, the grad seems to have no influence on the net.backward() call, as it is not used to update the network. Finally, in line 205, grad is reset to the diff in the next layer, which discard the previous grad computation.

I am confused about this part of the code. Could you help me understand it? I am not very fluent with Python, and this might be the reason why I am lost in here.

Thanks,

jiawei357 commented 8 years ago

What he does is use it as some kind of pointer and so he can update the gradient. And the updated gradient could be used for back propagation in the next layer.

2016年3月3日星期四，hermitman notifications@github.com 写道：

I have a question that may just be due to my lack of programming knowledge.

In line 191 and 198, the grad variable i updated with the computed gradient. However, the grad seems to have no influence on the net.backward() call, as it is not used to update the network. Finally, in line 205, grad is reset to the diff in the next layer, which discard the previous grad computation.

I am confused about this part of the code. Could you help me understand it? I am not very fluent with Python, and this might be the reason why I am lost in here.

Thanks,

— Reply to this email directly or view it on GitHub https://github.com/fzliu/style-transfer/issues/26#issuecomment-192069567 .

hermitman commented 8 years ago

@jiawei357

Hi, thanks for the response. I thought about this explanation, which indicates that grad is a pointer points to net.blobs[layer].diff[0]. However, I found two things that I do not understand:

I used id([variable]) to verify the memory address of grad, after assigning grad = net.blobs[layer].diff[0],

v.s. net.blobs[layer].diff[0]

and the two ids are not the same. (Is this a problem related to CPU/GPU address?)
Is it OK to insert additional gradient at a layer by adding the gradient computed from local loss to the backpropagated loss?

Thanks for the answer, I think the results are fine, but I am just curious about the code.

jiawei357 commented 8 years ago

Well for the first thing I'm not sure why that happened. What I did in this project is making a new custom layer called loss layer(Euclidean for content loss and the gram matrix thing for style loss). And in the custom layer I added the additional gradient in the back propagation process. And caffe took care of all the rest BP(from conv layer to data layer). So yeah I think in either way(BP layer by layer or using custom Python layer) you can both do that.

2016年3月3日星期四，hermitman notifications@github.com 写道：

@jiawei357 https://github.com/jiawei357

Hi, thanks for the response. I thought about this explanation, which indicates that grad is a pointer points to net.blobs[layer].diff[0]. However, I found two things that I do not understand:

I used id([variable]) to verify the memory address of grad, after assigning:

grad = net.blobs[layer].diff[0],

and the two ids are not the same. (Is this a problem related to CPU/GPU address?)

Is it OK to insert additional gradient at a layer by adding the gradient computed from local loss to the backpropagated loss?

Thanks for the answer, I think the results are fine, but I am just curious about the code.

— Reply to this email directly or view it on GitHub https://github.com/fzliu/style-transfer/issues/26#issuecomment-192072276 .

jiawei357 commented 8 years ago

Not sure if I stated that clear enough

2016年3月3日星期四，hermitman notifications@github.com 写道：

@jiawei357 https://github.com/jiawei357

Hi, thanks for the response. I thought about this explanation, which indicates that grad is a pointer points to net.blobs[layer].diff[0]. However, I found two things that I do not understand:

I used id([variable]) to verify the memory address of grad, after assigning:

grad = net.blobs[layer].diff[0],

and the two ids are not the same. (Is this a problem related to CPU/GPU address?)

Is it OK to insert additional gradient at a layer by adding the gradient computed from local loss to the backpropagated loss?

Thanks for the answer, I think the results are fine, but I am just curious about the code.

— Reply to this email directly or view it on GitHub https://github.com/fzliu/style-transfer/issues/26#issuecomment-192072276 .

hermitman commented 8 years ago

@jiawei357

Thanks for the clarification. So in your implementation, you have multiple loss computation at different conv layers, which are used as content/style layers?

Could I take a look at your network prototxt? I think that should answer the question xD

jiawei357 commented 8 years ago

Currently I'm on spring break so couldn't give you my prototxt. What I did is have multiple input layer. One for white noise, others for precomputed style or content gram matrix/activation and a custom layer that takes output of all those input layer and conv layers.

2016年3月3日星期四，hermitman notifications@github.com 写道：

@jiawei357 https://github.com/jiawei357

Thanks for the clarification. So in your implementation, you have multiple loss computation at different conv layers, which are used as content/style layers?

Could I take a look at your network prototxt? I think that should answer the question xD

— Reply to this email directly or view it on GitHub https://github.com/fzliu/style-transfer/issues/26#issuecomment-192075640 .

hermitman commented 8 years ago

@jiawei357 and your loss layer will do backpropagation from the end of the network to the input?

jiawei357 commented 8 years ago

In custom loss layer you only have to define the gradient for the bottom layer.

2016年3月3日星期四，hermitman notifications@github.com 写道：

@jiawei357 https://github.com/jiawei357 and your loss layer will do backpropagation from the end of the network to the input?

— Reply to this email directly or view it on GitHub https://github.com/fzliu/style-transfer/issues/26#issuecomment-192078664 .

hermitman commented 8 years ago

what is the bottom layer? input?

hermitman commented 8 years ago

I mean the input layers that connects to your loss layer. So, if you have one style layer and one content layer connected to the custom loss, then your loss will backprop to each of them, respectively.

fzliu commented 8 years ago

Hey there - didn't get a chance to read through the whole thread, but this might be of interest to you: https://github.com/fzliu/style-transfer/tree/gram-layer.

jiawei357 commented 8 years ago

The bottom layer also include the conv layer of your network. That's where we want to set the gradient. The gradient for those input layer could be set to be zero.

2016年3月3日星期四，hermitman notifications@github.com 写道：

I mean the input layers that connects to your loss layer. So, if you have one style layer and one content layer connected to the custom loss, then your loss will backprop to each of them, respectively.

— Reply to this email directly or view it on GitHub https://github.com/fzliu/style-transfer/issues/26#issuecomment-192080199 .

hermitman commented 8 years ago

@fzliu I just had a question about how the "grad" in the master branch style_optfn gets used. From the code, I do not see any reference to the computed "grad"

hermitman commented 8 years ago

@jiawei357

hmm, I am still confused here = =!

So, A gradient that is computed at your custom loss layer will travel through all the conv layers and finally reach the input image?

hermitman commented 8 years ago

Hey, all:

I think I got the idea from reading the code in the gram layer branch. Thanks @jiawei357 @fzliu

hermitman commented 8 years ago

@fzliu one last thing, where is the protobuf that has the gramianParameter defined? I couldn't locate it, and my caffe gives me error for not having it

fzliu commented 8 years ago

You'll need a custom version of Caffe which contains the necessary layer definition: https://github.com/dpaiton/caffe/tree/gramian

hermitman commented 8 years ago

got it thanks!

hermitman commented 8 years ago

@fzliu I run the code on both branch, the results are not the same. The master branch produces reasonable results while the gram-layer branch provide really strange result johannesburg-starry_night-vgg19-content-1e4-256

hermitman commented 8 years ago

@fzliu after some digging, I think the problem is that the network is not using style loss at all. I can reproduce the above error when I turn off grad update from style layers. Any ideas?

hermitman commented 8 years ago

@jiawei357 are you using the same gramian layer implementation from　https://github.com/dpaiton/caffe/tree/gramian?

hermitman commented 8 years ago

@fzliu More observations:

In the code, I did a validation on the output of the gramian layer in the modified caffe.

I think the output of the gramian layer is not correct. If I compare the output of the gramian layer and the output by simply computing the matrix multiplication of the convolution layer's result. The number do not match.

e.g.

in the network, there is a connection: conv1_1 -> conv1_1/gramian

the output of conv1_1/gramian should be the inner product of conv1_1's output. However the result do not match to the manual computation of conv1_1 using scipy.sgemm.

Am I the only one having problem with the Gramian layer?

Thanks,

fzliu commented 8 years ago

Try this one instead: https://github.com/fzliu/caffe/tree/gram-layer. I don't quite remember how it's different, but I remember doing some minor changes to the original gram layer implementation. I'll look into merging it into dpaiton's branch soon.

hermitman commented 8 years ago

@fzliu works this time. Thanks, I took a look at the layer implementation, but could not find obvious difference. I think the main issue might be how the pointers or data dimensions are tweaked.

I do have another question that I want to ask,

How do python decide when to copy by reference or by copy. I found several places in the code that you use .copy() while some other places that you use assignment. when we copy caffe's blob, do we need to use assignment or .copy()?

I found several places of these operations,

{master branch} -> style.py:184, you used grad as a "pointer" to update the network's blob directly.
{gram-layer} -> style.py:405, you copied data from one blob to another.
{master branch} -> style.py:139, you explicitly used shallow copy to create a copy of the blob and manipulate the copy's values.

in 1 and 2, the assignment obviously has different function, while in 1, it is a reference, and in 2, it is a copy.

I thought I understand the python assignment and copy well, but found it hard to differentiate these situations....... = =! Please teach me,

Thanks,

boother commented 8 years ago

Hi, everybody! Could somebody share mentioned above prototxt? Thanks!

lgyhero commented 7 years ago

@jiawei357 Could you please tell me your E-mail address？ I'v also defined a custom layer using PyCaffe, but get some trouble when override 'backward()' function. Hope to seek some advice from you. Thanks！

fzliu / style-transfer

Issue about backpropagation #26