cchallu / nbeatsx

MIT License
190 stars 49 forks source link

[Question] SELU weights and dropout #1

Open pnmartinez opened 3 years ago

pnmartinez commented 3 years ago

Hi,

My name is Pablo Navarro. Your team and I have already exchanged a few mails over the wonderful paper you've made. Thanks again for the contribution.

Now that the code is released, I have a couple question over the implementation of the SELU activation function.

Weight init

For SELU, you force lecun_normal which is in turn a pass on the init_weights() function:

def init_weights(module, initialization):
    if type(module) == t.nn.Linear:
        if initialization == 'orthogonal':
            t.nn.init.orthogonal_(module.weight)
        elif initialization == 'he_uniform':
            t.nn.init.kaiming_uniform_(module.weight)
        elif initialization == 'he_normal':
            t.nn.init.kaiming_normal_(module.weight)
        elif initialization == 'glorot_uniform':
            t.nn.init.xavier_uniform_(module.weight)
        elif initialization == 'glorot_normal':
            t.nn.init.xavier_normal_(module.weight)
        elif initialization == 'lecun_normal':
            pass
        else:
            assert 1<0, f'Initialization {initialization} not found'

How come the weights are initialized as lecun_normal simply by passing? On my machine, default PyTorch initializes weights uniformly, not normally.

DropOut on SELU

I believe that in order to make SELU useful, you need to use AlphaDropout() instead of regular DropOut() layers (PyTorch docs).

I can't find anything wrapping AlphaDropOut() in your code. Can you point me in the right direction or give the rationale behind it?

Cheers and keep up the good work!

kdgutier commented 3 years ago

DropOut and AlphaDropOut on SELU

Thanks for the comments. As you mentioned from the paper of the scaled exponential linear units https://arxiv.org/abs/1706.02515, on page 6, they recommend not use dropout as the extra variance hinders the convergence of the algorithm when using normalization. We observed some convergence issues when exploring the hyperparameter space. Although with optimal model configurations, the training procedure was stable.

One thing to keep in mind is that the two best regularization techniques we found in our experiments are early stopping and second ensembling. Since ensembling boosts accuracy from the diversity and variance of models, the interaction of AlphaDropOut with the ensemble might be something interesting to explore. Still, we will try the AlphaDropOut regularization to test the SELU paper recommendation on this regression setting.