Modalities / modalities

A framework for training multimodal foundation models.

MIT License

38 stars 3 forks source link

Feat: Various Configurable Initializations #161

Closed flxst closed 3 days ago

flxst commented 2 weeks ago

What does this PR do?

This PR implements the following weight initializations (see https://arxiv.org/abs/2312.16903):

plain (= all weights normally distributed)
scaled (= same as plain, but narrower distribution for projection weights W0 & W2)
scaled embed (= same as scaled, but wider distribution for embedding)

A weight initialisation component is introduced which modifies the model weights in place (see #168 for more details)

General Changes

Components and factories for plain, scaled and scaled_embed initialisation.
in GPT2 model training configs, the standard deviation std can now be set to the string auto (in which case it will equal sqrt(2/(5*hidden_dim)), see e.g. https://arxiv.org/abs/2312.16903)
The CoCa model, which previously used a hardcoded, (probably not entirely correct) scaled initialization (see #165), can now only use plain initialization

Breaking Changes

All training configs require an additional component for initialization of the raw model (i.e. the model with random weights), as shown here.

Checklist before submitting final PR

[x] My PR is minimal and addresses one issue / enhancement in isolation
[x] I have merged the latest version of the target branch into this feature branch
[x] I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
[x] I have run a sample config for model training
[x] I have fixed all failing tests (python tests/tests.py)