Closed joeyhng closed 6 years ago
You are right that batch norm is acting as Instance Normalization here because batch size is 1. We used it as it allowed to use larger learning rates. It also makes gradients flow equally into all the branches of the network.
Thanks for your clarification!
There are a lot of Batch Normalization layers in the models, but the code only uses batch size of 1. What role does the BN play in the model? Have you tried training without BN and how did it perform?