What does this PR do?

This PR introduces the components for weight initialisation and is based on PR #161. In PR #161 the differenct initialization methods plain, scaled and scaled_embed (see https://arxiv.org/abs/2312.16903) were implemented and added to the abstract NNModel class. Due to some design concerns (e.g., some GPT2 internals were called from the parent), we decided to introduce a weight initialisation component that modifies the model weights in place.

General changes

Components and factories for plain, scaled and scaled_embed initialisation.

Breaking Changes

The raw model (i.e., the model with random weights) must be initialised with a weight initialiser, as shown here.

Checklist before submitting final PR

[x] My PR is minimal and addresses one issue / enhancement in isolation
[x] I have merged the target branch into this feature branch
[x] I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
[x] I have run a sample config for model training
[x] I have fixed all failing tests (python tests/tests.py)

Modalities / modalities

Draft: Feat/initialization component #168

What does this PR do?

General changes

Breaking Changes

Checklist before submitting final PR