facebookresearch / DiT

Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
Other
6.11k stars 541 forks source link

Clarification on Zero Initialization in FinalLayer of DiT Model #82

Open denemmy opened 6 months ago

denemmy commented 6 months ago

Hello Facebook Research Team,

I am exploring the DiT as implemented in your repository and came across the weight initialization strategy for the FinalLayer, particularly observed in this section of the code.

The weights for the linear layer in the FinalLayer are initialized to zeros:

nn.init.constant_(self.final_layer.linear.weight, 0)
nn.init.constant_(self.final_layer.linear.bias, 0)

Typically, neural network weights are initialized with non-zero values to break symmetry and ensure diverse feature learning. While I understand the rationale behind zero initialization of modulation weights in other parts of the model, the zero initialization in this linear layer caught my attention.

Is the zero initialization of weights in this non-modulation linear layer intentional, and could you provide any insights into this choice?

Thank you for any information or insights you can provide!

Best regards, Danil.

tanghengjian commented 5 months ago

zero initializtion may help for model's stable and reproducible ?

shy19960518 commented 4 months ago

Same confusion. The most outrageous thing is that the model can still learn well in my experiment. Can someone have an explains. ^ ^