Closed PkuRainBow closed 6 years ago
Because it can make the output of the block always zero (at the first batch before update the parameters). In this way, it can be inserted to any existing architecture, while does not affect the architecture original outputs. You can find it in 3.3 section of the paper, it saids
The residual connection allows us to insert a new non-local block into any pre-trained model, without breaking its initial behavior (e.g., if Wz is initialized as zero).
I just can not figure out whey initialize the weights and biases within the self.W as zero.