faustomilletari / VNet

GNU General Public License v3.0
284 stars 123 forks source link

+/- sign in backward phase of the dice layer #32

Open ntajbakhsh opened 7 years ago

ntajbakhsh commented 7 years ago

Hi Fausto, I wonder why you have a "+=" operator for the feature map 0 (background), but a "-=" operator for the feature map 1 (foreground). bottom[btm].diff[i, 0, :] += 2.0 * ( (self.gt[i, :] * self.union[i]) / ((self.union[i]) ** 2) - 2.0*prob[i,1,:]*(self.intersection[i]) / ((self.union[i]) ** 2))

bottom[btm].diff[i, 1, :] -= 2.0 * ( (self.gt[i, :] * self.union[i]) / ((self.union[i]) ** 2) - 2.0*prob[i,1,:]*(self.intersection[i]) / ((self.union[i]) ** 2))

Also, dice is supposed to get maximized during the training process, but it appears that it is treated as a loss metric in your code (code seems to attempt to minimize it).

Could you please help me with the above 2 questions?

Nima

faustomilletari commented 7 years ago

the answers are:

yes. (it’s just a trick to speed things up a little. it pushes in the right direction both according to background and foreground.)

yes. (minimising -dice)

On 21 Apr 2017, at 19:27, ntajbakhsh notifications@github.com wrote:

Hi Fausto, I wonder why you have a "+=" operator for the feature map 0 (background), but a "-=" operator for the feature map 1 (foreground). bottom[btm].diff[i, 0, :] += 2.0 ( (self.gt[i, :] self.union[i]) / ((self.union[i]) 2) - 2.0prob[i,1,:](self.intersection[i]) / ((self.union[i]) 2))

bottom[btm].diff[i, 1, :] -= 2.0 ( (self.gt[i, :] self.union[i]) / ((self.union[i]) 2) - 2.0prob[i,1,:](self.intersection[i]) / ((self.union[i]) 2))

Also, dice is supposed to get maximized during the training process, but it appears that it is treated as a loss metric in your code (code seems to attempt to minimize it).

Could you please help me with the above 2 questions?

Nima

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/faustomilletari/VNet/issues/32, or mute the thread https://github.com/notifications/unsubscribe-auth/AMtsvgjCeOY5tPPzNf-GqTD5_Z_NeRlLks5ryTt_gaJpZM4NE5Di.

ntajbakhsh commented 7 years ago

Thanks for the prompt reply!

Theoretically, both operators should be +=, right? I mean the dice metric is class agnostic, so I assume the operator should be the same for both foreground and background, or I'm missing something?

Could you point out where in the code you tell it to minimize the (- dice)? In the forward pass, you compute +dice, and I don't see it in the backward phase either.

faustomilletari commented 7 years ago

in caffe one must write own derivatives. therefore if you take the derivative of dice and you add a minus in front of it you get the derivative of -dice. actually i think i recall that caffe is optimising always trying to find a minimum. therefore in this case we should give in the derivative of dice. which we do.

the fact that in the forward pass i’m displaying the dice does not matter. i could display something completely arbitrary in the forward pass and still optimise correctly if my derivatives are correct in the backward pass.

of course, this only if other subsequent layers are not dependent on the forward pass of this layer :P which is in this case is not an issue (last layer).

The code is actually not that incredibly well written. if this was done in tensor flow for example, auto differentiation would force you to have everything consistent/correct.

On 21 Apr 2017, at 19:56, ntajbakhsh notifications@github.com wrote:

Thanks for the prompt reply!

Theoretically, both operators should be +=, right? I mean the dice metric is class agnostic, so I assume the operator should be the same for both foreground and background, or I'm missing something?

Could you point out where in the code you tell it to minimize the (- dice)? In the forward pass, you compute +dice, and I don't see it in the backward phase either.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/faustomilletari/VNet/issues/32#issuecomment-296330521, or mute the thread https://github.com/notifications/unsubscribe-auth/AMtsvtQihw7WC-dGvyARvlDEWqCG97faks5ryUIogaJpZM4NE5Di.

ntajbakhsh commented 7 years ago

Thanks. So, -= operator is acting as multiplying the gradient with -1, which should have been done for both bottom[btm].diff[i, 0, :] and bottom[btm].diff[i, 1, :], but you have intentionally not done for the background as a trick to speed up the training process, is my understanding correct?

faustomilletari commented 7 years ago

Actually for the background I'm just trying to make it do the opposite of what the foreground is doing, intuitively.

One even better thing to do could be to write the soft max only for one channel (foreground) and take the derivative of that expression, together with the soft max (all in 1 layer). I think that would be better under all points of view.

Fausto Milletarì Sent from my iPhone

On 21. Apr 2017, at 20:29, ntajbakhsh notifications@github.com wrote:

Thanks. So, -= operator is acting as multiplying the gradient with -1, which should have been done for both bottom[btm].diff[i, 0, :] and bottom[btm].diff[i, 1, :], but you have intentionally not done for the background as a trick to speed up the training process, is my understanding correct?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

ntajbakhsh commented 7 years ago

All your explanations started to make sense after I re-read the paper and did the math by myself. Thanks!

rishabhsshah commented 6 years ago

Hi Fausto,

Firstly, thanks for open-sourcing the entire implementation of dice loss. It was very helpful. However, I am facing a few issues in understanding it.

I am unable to understand why you added the minus sign "-=" for the feature map 1 (foreground) in the backward pass of the Dice loss layer in python.

bottom[btm].diff[i, 1, :] -= 2.0 * ( (self.gt[i, :] * self.union[i]) / ((self.union[i]) ** 2) - 2.0*prob[i,1,:]*(self.intersection[i]) / ( (self.union[i]) ** 2))

Ideally, we are supposed to be maximizing the Dice score or minimize the -(Dice score). The derivative in the above formula is actually doing the opposite of maximizing the dice score. It is minimizing the dice score.

Whenever we compute derivative of a function with respect to an input it always point us to the direction of increase in the functions value. Therefore multiplying the derivative by a "-" sign will point us in the direction of minimizing it.

The gradient of the dice score can be calculated by the following formula: 2.0 * ((self.gt[i, :] * self.union[i]) / ((self.union[i]) ** 2) - 2.0*prob[i,1,:]*(self.intersection[i]) / (self.union[i]) ** 2))

That is the direction of increasing Dice score. When we add a "-=" sign in the the above code lines we are forcing the network to minimize the dice score.

Please help me understand where I am going wrong.

Thanks, Rishabh

sara-eb commented 5 years ago

@faustomilletari @ntajbakhsh Hi everyone, I went over all the comments here. I am implementing generalized dice loss python layer, I have extracted the labels to be treated as binary segmentation problem, Now I have some misunderstanding on background labels, which is 0 values and is the channel [0] in a subvolume NxCxDxWxH (C stands for class numbers: 5 classes, and D stands for the depth of subvolume).

according to this line for binary segmentation with Blob shape with 2 feature maps, it is clear how to do that, what about generalized dice loss? because I have 4 feature maps for foreground classes. How can I calculate the diff for the background here?

Your expert opinion is really appreciated Thanks

sara-eb commented 5 years ago

Hi everyone,

I really need help with this, even I have gone through different implentation and as the user has mentioned here, however, I am not sure if my implementation is correct or not. The values of loss is between 0 and 1, and at some points becomes 1.

I implemented the loss explained in this reference, "Tversky loss, a generalized form of dice loss, which is identical to dice loss when alpha=beta=0.5", However, I have a question and hoping to get a response here, I implemented the Equation 3 , and calculated the backward pass according to Equation 4, which is the gradient of the loss in Equation 3 with respect to p0_i ( the output of the softmax layer, i.e., soft outputs), the probability of a voxel i to be object-of-interest (lesion) . Authors have brought Equation 5 , which is the gradient of the loss in Equation 3 with respect to p1_i, the probability of a voxel i to be non-lesion.

I created the one-hot encoding of ground truth, and saved the gradients into two tensors, dTdp0 (gradient with respect to p0) and dTdp1 (gradient with respect to p1).
What I did is that in the backward pass I only send the dTdP0 since I was not sure how to combine these two gradients in the gradient calculation,

def backward(self, top, propagate_down,bottom):
         for i in range(0, bottom[0].diff.shape[1]):
                 bottom[0].diff[:,i,:,:,:]-=self.dTdp0[:,i,:,:,:]

My questions are:

  1. How should I differentiate the gradient of background with label 0 , and other object classes? as earlier you have discussed in this thread.

  2. What about dTdp1? How can I include it in the gradient calculation?

Your help is really appreciate,