Towards stable modalities version

This PR further stabilise the codebase and makes training more robust also w.r.t. loss spikes, which we fixed via scaled weight initialisation and an increased batch size in our experiments. The PR also fixes all failing tests and adds a simple entrypoint for running cpu, single-gpu and multi-gpu tests. The PR contains multiple sub PRs.

General changes:

Bug fix: the model evaluation mode is now properly deactivated after evaluation (see PR #131)
Bug fix: Fixed the implementation of Pre-LN for GPT2 model (see PR #136)
Enhancement: Further mixed precision strategies; also added one matching MegatronLM's.
Enhancement: Single, unified entrypoint for running cpu, single-gpu and multi-gpu tests. All tests fixed. (PR #155)
Enhancement: Previously, we would chunk the dataset into block_size long chunks. Each chunk would then be used for training individually. As a result, the last token of a block would be only used as a target but never as an input. We changed this, such that we reuse the last token of a batch as the first one of the subsequent batch. (PR #158)
Bug: Indexing of the original samples of the dataset pbin files had multiple bugs. The index tuples are now always in bytes and the start of the first sample in the data section starts at byte 0 (before the was a wrong offset) (PR #164)
Enhancement: Improvements on the current pull request template and addition of several issue templates (bug report, documentation, feature request, blank) (PR #172)
Components and factories for plain, scaled and scaled_embed initialisation. (PR #161)
in GPT2 model training configs, the standard deviation std can now be set to the string auto (in which case it will equal sqrt(2/(5*hidden_dim)), see e.g. https://arxiv.org/abs/2312.16903) (PR #161)
The CoCa model, which previously used a hardcoded, (probably not entirely correct) scaled initialization (see #165), can now only use plain initialization (PR #161)

Breaking changes:

Enhancement: Logging is now always based on #training steps and #consumed tokens (PR #137) This change is a breaking change and the experiment configs need to adapated as shown here.
Enhancement: The model parameters are now grouped within the respective model. The optimizer can leverage these groups to e.g., only apply weight decay to non-layer-norm weights. See here for the necessary config changes. (PR #139)
Enhancement: We support now different attention implementations (manual, pytorch flash, DAO flash) See here for the respective config changes. (PR #138)
Enhancement: replaced block_size in Dataset, Model and NumberConversion with sequence_length (PR #158)
Enhancement: block_size is now sequence_length +1 and we should always specify sequence_length as a value of power of 2. (PR #158)
Enhancement: Restricted the codebase to the officially supported python versions 3.10 and 3.11 ((PR #174))
All training configs require an additional component for initialization of the raw model (i.e. the model with random weights), as shown here. (PR #161)

Checklist before submitting final PR

[ ] My PR is minimal and addresses one issue / enhancement in isolation
[x] I have merged main into this feature branch
[x] I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
[x] I have run a sample config for model training
[x] I have fixed all failing tests (python tests/tests.py)

Modalities / modalities

Towards stable modalities version #141

Checklist before submitting final PR