DropoutLayer and HuberLoss

shlublu commented 5 years ago

The documentation of DropoutLayer says:

A dropout layer drops its input, i.e. sets it to 0 with a given probability.

I have two things to report here:

I understood that dropping out an input should be compensated by boosting the activation function's ouput of its neuron by the equivalent of the input power that was lost.

_Hinton: http://videolectures.net/nips2012_hinton_networks/ iamtrask.io / A. Karpathy : https://iamtrask.github.io/2015/07/28/dropout/ (Ctrl/F: "EDIT: Line 9")_

This is somehow equivalent to setting the dropped out inputs to the mean of the unchanged inputs instead of zeroing them.

Does shark::DropoutLayer do such a thing? If not (and if I am not mistaken) it would be valuable to implement it.

I investigated in Shark's (4.0.0) code as I had troubles using DropoutLayer: my networks were always producing a constant ouput (the last hidden layer's bias I guess). I found out that the probability given at DropoutLayer's construction is not the probability of an entry to be dropped out, but the probability for it not to be.

Basically, DropoutLayer(layerShape, useDropout ? 0.5 : 0.0) drops 100% of the inputs when useDropoutis false, where one can understand (at least I'm the one :p) the opposite. I just fixed that by writing DropoutLayer(layerShape, useDropout ? 0.5 : 1.0) instead, but it would be interesting do document it that way.

I'm also wondering: when using shark::HuberLoss , does the non-compensated dropout alter its behaviour? The documentations says the following, while dropping out is a kind of rescaling as the overall input of the activation function is reduced:

Please not that, due to its nature, the error function is not scale invariant. thus rescaling the dataset changes the behaviour.

Ulfgard commented 5 years ago

We implement dropout as introduced in the original paper in section 4, page 1933 (or 5 in the pdf) http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf the p there describes the probability for a neuron to not be dropped out, therefore p=1 is "all neurons are in".

I think the implementation you link to from A. Karpathy confuses two parts:

the dropout training part
approximating dropout using rescaling with (1/p) for deterministic testing.

//edit if you find a publication showing that the other implementation is now state-of-the-art, we will of course, adapt it.

Ulfgard commented 5 years ago

regarding your huber loss question: the network, when trained with dropout will rescale its weight accordingly (if that is needed, there are other solutions, e.g. having several correlated neurons so that on average their sum is approximately the correct answer). So there is no scaling issue.

However, using dropout on the output layer is never a good idea as it is impossible for the network to compensate for it. And dropping out on any hidden layer will in general not change the scale of the output

shlublu commented 5 years ago

Thanks a lot for your replies!

Regarding the Huber loss, I explained it badly: I don't dropout the output layer (for the reason you said), but the hidden layers' ouputs. So that's fine: you answered my question.

Regarding the dropout layer:

OK, got it for the probability of NOT being dropped out. Someone had to make that confusion. I successfullly did :)
When testing and when using the NN in production, my understanding is that the dropout has to be turned off. So the rescaling applies only when training (so when there is an actual dropout). But I'll investigate further and get back to you should I find any relevant source.

Thanks again for your time!

shlublu commented 5 years ago

Ok, I think we were speaking of two ways to do the same thing, yours being standard, not mine.

Reading pages 1930 to 1932 of JMLRDropout (the document you linked, and that was the basis of the lecture by Hinton I referred to in my first post), we see that:

Dropped out units are just set to zero at training time (which is what Shark does)
and this is compensated by applying the factor p to the weights w at test time (and at run time, we can deduce), which allows ensuring (fig.2 page 1931) that "The output at test time is same as the expected output at training time."

So the compensation I suggested in my initial post and based on Karpathy's work is not according to the paper strictly speaking, even though it achieves the same thing: instead of applying the factor p to the weights at test time, a factor corresponding to the loss of the dropped out units is applied to the activation function at training time.

So sorry about that, I should have asked my question differently: how to apply the multiplying factor p-Train to the weights when I'm testing with p-Test = 1, to make the ouputs consistent? I don't think the DropoutLayer does that (or I didn't find how), but is there any standard way to do this with Shark?

Thanks!

Ulfgard commented 5 years ago

Hi,

I think that is currently not implemented. It is a combination of oversight (because we had that before) and of the realisation that the approximation is often not very good. I might reimplement that soon.

shlublu commented 5 years ago

Thanks a lot. The compensation of dropped out units at training time is certainly easier to implement (that's a few lines of code) though I would understand that you would prefer an approximation at testing and production time as this is described this way in the paper. Anyway dropout is hardly usable with none of these.

Shark-ML / Shark

DropoutLayer and HuberLoss #250