Open shlublu opened 5 years ago
We implement dropout as introduced in the original paper in section 4, page 1933 (or 5 in the pdf) http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf the p there describes the probability for a neuron to not be dropped out, therefore p=1 is "all neurons are in".
I think the implementation you link to from A. Karpathy confuses two parts:
//edit if you find a publication showing that the other implementation is now state-of-the-art, we will of course, adapt it.
regarding your huber loss question: the network, when trained with dropout will rescale its weight accordingly (if that is needed, there are other solutions, e.g. having several correlated neurons so that on average their sum is approximately the correct answer). So there is no scaling issue.
However, using dropout on the output layer is never a good idea as it is impossible for the network to compensate for it. And dropping out on any hidden layer will in general not change the scale of the output
Thanks a lot for your replies!
Regarding the Huber loss, I explained it badly: I don't dropout the output layer (for the reason you said), but the hidden layers' ouputs. So that's fine: you answered my question.
Regarding the dropout layer:
Thanks again for your time!
Ok, I think we were speaking of two ways to do the same thing, yours being standard, not mine.
Reading pages 1930 to 1932 of JMLRDropout (the document you linked, and that was the basis of the lecture by Hinton I referred to in my first post), we see that:
So the compensation I suggested in my initial post and based on Karpathy's work is not according to the paper strictly speaking, even though it achieves the same thing: instead of applying the factor p to the weights at test time, a factor corresponding to the loss of the dropped out units is applied to the activation function at training time.
So sorry about that, I should have asked my question differently: how to apply the multiplying factor p-Train
to the weights when I'm testing with p-Test = 1
, to make the ouputs consistent? I don't think the DropoutLayer
does that (or I didn't find how), but is there any standard way to do this with Shark?
Thanks!
Hi,
I think that is currently not implemented. It is a combination of oversight (because we had that before) and of the realisation that the approximation is often not very good. I might reimplement that soon.
Thanks a lot. The compensation of dropped out units at training time is certainly easier to implement (that's a few lines of code) though I would understand that you would prefer an approximation at testing and production time as this is described this way in the paper. Anyway dropout is hardly usable with none of these.
I have two things to report here:
_Hinton: http://videolectures.net/nips2012_hinton_networks/ iamtrask.io / A. Karpathy : https://iamtrask.github.io/2015/07/28/dropout/ (Ctrl/F: "EDIT: Line 9")_
This is somehow equivalent to setting the dropped out inputs to the mean of the unchanged inputs instead of zeroing them.
Does
shark::DropoutLayer
do such a thing? If not (and if I am not mistaken) it would be valuable to implement it.DropoutLayer
: my networks were always producing a constant ouput (the last hidden layer's bias I guess). I found out that the probability given atDropoutLayer
's construction is not the probability of an entry to be dropped out, but the probability for it not to be.Basically,
DropoutLayer(layerShape, useDropout ? 0.5 : 0.0)
drops 100% of the inputs whenuseDropout
isfalse
, where one can understand (at least I'm the one :p) the opposite. I just fixed that by writingDropoutLayer(layerShape, useDropout ? 0.5 : 1.0)
instead, but it would be interesting do document it that way.shark::HuberLoss
, does the non-compensated dropout alter its behaviour? The documentations says the following, while dropping out is a kind of rescaling as the overall input of the activation function is reduced: