I am exploring the DiT as implemented in your repository and came across the weight initialization strategy for the FinalLayer, particularly observed in this section of the code.
The weights for the linear layer in the FinalLayer are initialized to zeros:
Typically, neural network weights are initialized with non-zero values to break symmetry and ensure diverse feature learning. While I understand the rationale behind zero initialization of modulation weights in other parts of the model, the zero initialization in this linear layer caught my attention.
Is the zero initialization of weights in this non-modulation linear layer intentional, and could you provide any insights into this choice?
Thank you for any information or insights you can provide!
Hello Facebook Research Team,
I am exploring the DiT as implemented in your repository and came across the weight initialization strategy for the FinalLayer, particularly observed in this section of the code.
The weights for the linear layer in the FinalLayer are initialized to zeros:
Typically, neural network weights are initialized with non-zero values to break symmetry and ensure diverse feature learning. While I understand the rationale behind zero initialization of modulation weights in other parts of the model, the zero initialization in this linear layer caught my attention.
Is the zero initialization of weights in this non-modulation linear layer intentional, and could you provide any insights into this choice?
Thank you for any information or insights you can provide!
Best regards, Danil.