Training with Rational Activations on very deep ResNets.

Thanks for your interest in our work. We haven't tried training very deep rational networks so my intuition is limited here. There is a possibility that the weight initialization has a bad effect on the rational layers as the depth increases. One potential remedy would be to fine-tune a pretrained relu resnet by replacing the activation functions by rationals and just training the rational functions. I'm curious to see why the loss becomes nan in your example. Perhaps you could plot the different rational functions (there should be approximatively one function per layer) to see if one of them becomes singular (with a simple pole) and which layer is affected. Finally, and depending and the result of the above suggestion, there could be some numerical instabilities due to having an overall rational network of super large degree (3^164). I guess one could use rational functions for the first few layers (like 18-38 layers in your experiments to benefit from the extra approximation power) and then use ReLU for the rest of the networks.

NBoulle / RationalNets

Training with Rational Activations on very deep ResNets. #3