scaled (= same as plain, but narrower distribution for projection weights W0 & W2)
scaled embed (= same as scaled, but wider distribution for embedding)
A weight initialisation component is introduced which modifies the model weights in place (see #168 for more details)
General Changes
Components and factories for plain, scaled and scaled_embed initialisation.
in GPT2 model training configs, the standard deviation std can now be set to the string auto (in which case it will equal sqrt(2/(5*hidden_dim)), see e.g. https://arxiv.org/abs/2312.16903)
The CoCa model, which previously used a hardcoded, (probably not entirely correct) scaled initialization (see #165), can now only use plain initialization
Breaking Changes
All training configs require an additional component for initialization of the raw model (i.e. the model with random weights), as shown here.
Checklist before submitting final PR
[x] My PR is minimal and addresses one issue / enhancement in isolation
[x] I have merged the latest version of the target branch into this feature branch
[x] I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
[x] I have run a sample config for model training
[x] I have fixed all failing tests (python tests/tests.py)
What does this PR do?
This PR implements the following weight initializations (see https://arxiv.org/abs/2312.16903):
A weight initialisation component is introduced which modifies the model weights in place (see #168 for more details)
General Changes
std
can now be set to the stringauto
(in which case it will equalsqrt(2/(5*hidden_dim))
, see e.g. https://arxiv.org/abs/2312.16903)Breaking Changes
Checklist before submitting final PR
python tests/tests.py
)