Shark-ML / Shark

The Shark Machine Leaning Library. See more:
http://shark-ml.github.io/Shark/
GNU Lesser General Public License v3.0
493 stars 130 forks source link

DropoutLayer and HuberLoss #250

Open shlublu opened 5 years ago

shlublu commented 5 years ago

A dropout layer drops its input, i.e. sets it to 0 with a given probability.

I have two things to report here:

  1. I understood that dropping out an input should be compensated by boosting the activation function's ouput of its neuron by the equivalent of the input power that was lost.

_Hinton: http://videolectures.net/nips2012_hinton_networks/ iamtrask.io / A. Karpathy : https://iamtrask.github.io/2015/07/28/dropout/ (Ctrl/F: "EDIT: Line 9")_

This is somehow equivalent to setting the dropped out inputs to the mean of the unchanged inputs instead of zeroing them.

Does shark::DropoutLayer do such a thing? If not (and if I am not mistaken) it would be valuable to implement it.

  1. I investigated in Shark's (4.0.0) code as I had troubles using DropoutLayer: my networks were always producing a constant ouput (the last hidden layer's bias I guess). I found out that the probability given at DropoutLayer's construction is not the probability of an entry to be dropped out, but the probability for it not to be.

Basically, DropoutLayer(layerShape, useDropout ? 0.5 : 0.0) drops 100% of the inputs when useDropoutis false, where one can understand (at least I'm the one :p) the opposite. I just fixed that by writing DropoutLayer(layerShape, useDropout ? 0.5 : 1.0) instead, but it would be interesting do document it that way.

Please not that, due to its nature, the error function is not scale invariant. thus rescaling the dataset changes the behaviour.

Ulfgard commented 5 years ago

We implement dropout as introduced in the original paper in section 4, page 1933 (or 5 in the pdf) http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf the p there describes the probability for a neuron to not be dropped out, therefore p=1 is "all neurons are in".

I think the implementation you link to from A. Karpathy confuses two parts:

  1. the dropout training part
  2. approximating dropout using rescaling with (1/p) for deterministic testing.

//edit if you find a publication showing that the other implementation is now state-of-the-art, we will of course, adapt it.

Ulfgard commented 5 years ago

regarding your huber loss question: the network, when trained with dropout will rescale its weight accordingly (if that is needed, there are other solutions, e.g. having several correlated neurons so that on average their sum is approximately the correct answer). So there is no scaling issue.

However, using dropout on the output layer is never a good idea as it is impossible for the network to compensate for it. And dropping out on any hidden layer will in general not change the scale of the output

shlublu commented 5 years ago

Thanks a lot for your replies!

Regarding the Huber loss, I explained it badly: I don't dropout the output layer (for the reason you said), but the hidden layers' ouputs. So that's fine: you answered my question.

Regarding the dropout layer:

Thanks again for your time!

shlublu commented 5 years ago

Ok, I think we were speaking of two ways to do the same thing, yours being standard, not mine.

Reading pages 1930 to 1932 of JMLRDropout (the document you linked, and that was the basis of the lecture by Hinton I referred to in my first post), we see that:

So the compensation I suggested in my initial post and based on Karpathy's work is not according to the paper strictly speaking, even though it achieves the same thing: instead of applying the factor p to the weights at test time, a factor corresponding to the loss of the dropped out units is applied to the activation function at training time.

So sorry about that, I should have asked my question differently: how to apply the multiplying factor p-Train to the weights when I'm testing with p-Test = 1, to make the ouputs consistent? I don't think the DropoutLayer does that (or I didn't find how), but is there any standard way to do this with Shark?

Thanks!

Ulfgard commented 5 years ago

Hi,

I think that is currently not implemented. It is a combination of oversight (because we had that before) and of the realisation that the approximation is often not very good. I might reimplement that soon.

shlublu commented 5 years ago

Thanks a lot. The compensation of dropped out units at training time is certainly easier to implement (that's a few lines of code) though I would understand that you would prefer an approximation at testing and production time as this is described this way in the paper. Anyway dropout is hardly usable with none of these.