Closed w-hc closed 6 years ago
ReLU should work just as well. I chose PReLU because their outputs serve better for visualizations. And personally I like the idea of a learnable slope in the activation function (although various papers show it's not always better)
But I tried replacing PReLU with ReLU, keeping the other config unchanged, and the training does not converge. The default setup seems reasonable though. It seems a little surprising.
For training with ReLU, you need to initialize convolutional layers more carefully. E.g. you can use kaiming initialization with gain for ReLU nonlinearity. You can do it with these lines in initialization of EmbeddingNet:
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
@w-hc did you try training with this change?
I will try in a minute. Sry.
No it still does not work. The basic classification training loss gets stuck at 2.3 i.e. random guessing. I don't think initialization can play such a huge role. It might really have to do with the fact that the final output is in 2d and in this case throwing away 3/4 of the space is too difficult for the network.
update:
lowering the starting learning rate down to 1e-3 can help to make a pure ReLU network trainable. It still converges much lower compared to PReLU. I should have spent a little more time doing hyper-parameter searching. The embedding looks as expected: 10 beams squeezed in the first quadrant.
Using vanilla ReLU for all layers, and simply changing the nonlinear in classification net to a PReLU, with default initialization, would make the network converge faster and get good looking embeddings. Constraining them to a quadrant is really too difficult.
Hi I noticed that you used parametric ReLU for these experiments, and I tried replacing them with vanilla ReLU. It turns out that even for simple MNIST classification network, the training cannot progress at all. It is true that ReLU imposes the additional constraint that only the positive quadrant can be used, but still it surprises me that the training loss stays at 2.3, which means nothing.
Did you adopt PReLU as conscious choice and what was the rationale? Thank you so much.