Why vanilla ReLU cannot train at all.

adambielski / siamese-triplet

Siamese and triplet networks with online pair/triplet mining in PyTorch

BSD 3-Clause "New" or "Revised" License

3.11k stars 635 forks source link

Why vanilla ReLU cannot train at all. #17

Closed w-hc closed 6 years ago

w-hc commented 6 years ago

Hi I noticed that you used parametric ReLU for these experiments, and I tried replacing them with vanilla ReLU. It turns out that even for simple MNIST classification network, the training cannot progress at all. It is true that ReLU imposes the additional constraint that only the positive quadrant can be used, but still it surprises me that the training loss stays at 2.3, which means nothing.

Did you adopt PReLU as conscious choice and what was the rationale? Thank you so much.

adambielski commented 6 years ago

ReLU should work just as well. I chose PReLU because their outputs serve better for visualizations. And personally I like the idea of a learnable slope in the activation function (although various papers show it's not always better)

w-hc commented 6 years ago

But I tried replacing PReLU with ReLU, keeping the other config unchanged, and the training does not converge. The default setup seems reasonable though. It seems a little surprising.

adambielski commented 6 years ago

For training with ReLU, you need to initialize convolutional layers more carefully. E.g. you can use kaiming initialization with gain for ReLU nonlinearity. You can do it with these lines in initialization of EmbeddingNet:

for m in self.modules():
    if isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')

adambielski commented 6 years ago

@w-hc did you try training with this change?

w-hc commented 6 years ago

I will try in a minute. Sry.

w-hc commented 6 years ago

No it still does not work. The basic classification training loss gets stuck at 2.3 i.e. random guessing. I don't think initialization can play such a huge role. It might really have to do with the fact that the final output is in 2d and in this case throwing away 3/4 of the space is too difficult for the network.

update:

lowering the starting learning rate down to 1e-3 can help to make a pure ReLU network trainable. It still converges much lower compared to PReLU. I should have spent a little more time doing hyper-parameter searching. The embedding looks as expected: 10 beams squeezed in the first quadrant.
Using vanilla ReLU for all layers, and simply changing the nonlinear in classification net to a PReLU, with default initialization, would make the network converge faster and get good looking embeddings. Constraining them to a quadrant is really too difficult.