Closed qinziqiao closed 5 years ago
@qinziqiao Hi, you can find the reason in Section 4.1 of the Paper which said
The scale parameter of this BN layer is initialized as zero, following [17]. This ensures that the initial state of the entire non-local block is an identity mapping, so it can be inserted into any pre-trained networks while maintaining its initial behavior.
Thanks for your reply. I'm so careless that neglect this line.
Hi guy. I have a question why the bn.weight is initialized as zero