This PR adds multiple standard model weight initialization options, including defaults such as normal & Kaiming, fan in, Megatron, and Mitchell. Mitchell is used by OLMo and Megatron is used by models like Lama 2, (I believe) Pythia, etc.
I also added support for RWKV small embedding initialization, which looks like a useful way to speed up the initial training of the embedding weights. (It is not compatible with Megatron init).
Also, model.py apparently wasn't ruff formatted before, but it is now.
This PR adds multiple standard model weight initialization options, including defaults such as normal & Kaiming, fan in, Megatron, and Mitchell. Mitchell is used by OLMo and Megatron is used by models like Lama 2, (I believe) Pythia, etc.
I also added support for RWKV small embedding initialization, which looks like a useful way to speed up the initial training of the embedding weights. (It is not compatible with Megatron init).
Also, model.py apparently wasn't ruff formatted before, but it is now.