RMS norm implementation

Implemented RMS Norm based on https://arxiv.org/pdf/1910.07467.pdf and majorly adopted from https://github.com/facebookresearch/llama/blob/main/llama/model.py#L34C1-L77C36
Refactored the original LayerNorm implementation as a component in modalities: https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html

The layer norm was originally instantiated individually for every attention block internally. https://github.com/Modalities/modalities/blob/dd0db07bbe631e9dc30f35912076d26603f4a6b7/src/modalities/models/gpt2/gpt2_model.py#L193

For every new layer norm type we would have had to add an if-clause to check which layer norm we would want to instantiate. As a workaround, we now pass in the layer norm object from outside to the GPT2 model and copy it in every attention block. Note that we override the copy function in the layer norm implementations.

For the future, it would make sense to have the possibility to instantiate Lists of components. For instance, a GPTModel would have a dependency for a list of attention block. We would specify a single attention block and instantiate the block n times (see num_instances in the YAML below). Each attention block would have a dependency for a layer norm and would not have to be copied internally anymore.

This is an example:

model:
  component_key: model
  variant_key: gpt2
  config:
    [...]
    attention_blocks:
        component_key: attention_block
        variant_key: gpt2_attention_block
        num_instances: 12 
        config:    
          n_embd: 768
          dropout: 0.0
          scaling_factor: 3
         [...]

Modalities / modalities

RMS norm implementation #67