KieranLitschel / XSWEM

A simple and explainable deep learning model for NLP.
MIT License
1 stars 0 forks source link

Dropout before max pooling killing embedding components during training #10

Closed KieranLitschel closed 3 years ago

KieranLitschel commented 3 years ago

When a unit is dropped out its value is set to 0. As we are applying dropout directly to the word embeddings, for long input sequences, it becomes increasingly likely that at least one component in each dimension will be set to zero. This means that negative components can often die, as they get stuck with negative values due to the zeros being introduced by dropout being taken as the maximum.

This is particularly problematic as our distribution for initializing embeddings is centred at zero, meaning around half of the components are initialized as values less than zero. The histogram below exemplifies this issue.

image

One possible solution is to initialize all embedding weights with values greater than zero. This should significantly reduce the number of dying units, but units will still die if they are updated with a value less than zero.

A better solution would be to make it so that zero is ignored during the max-pooling operation. But this may slow down training significantly, which would make the first solution more preferable.

KieranLitschel commented 3 years ago

It seems like the main cause of the above distribution was too high a dropout rate. We were using a dropout rate of 0.8, but switching to a dropout rate of 0.2 we get the distribution below, which looks much better.

image

We explored shifting the centre of the initialization right 0.05 so that all initialized values would be greater than or equal to zero. The distribution with this modification is shown below.

image

We observe the same pattern as with the zero centred distribution, with half the values appearing to have stayed at their initialized values. Surprisingly the distributions seem to be very similar, just the centre shifted.

So it now seems more like this behaviour is being caused by the max pool layer, with a lot of values just never being seen during training.

Hence it seems like this is more a property of SWEM-max, and is not a bug, so we are closing this issue.