Closed le1nux closed 2 days ago
We also need to add the SwiGLU layers to the parameters filters for weight initialization, https://github.com/Modalities/modalities/blob/5a2727fe3004c1e0739d23a733254f67c8ffdbd4/src/modalities/nn/model_initialization/parameter_name_filters.py#L36
The current SwiGLU implementation defines the projection matrice names, different from the original paper (https://arxiv.org/pdf/2002.05202). We should stick to the
W, V, W_2
names. The projection namec_proj
in SwiGLU has the same name as a projection in GeLU already having lead to side-effects for weight initialisation (see comments in PR #168 )https://github.com/Modalities/modalities/blob/f810fcce978e2f4fc577edf337835b6f4afa8aa9/src/modalities/models/model.py#L30C6-L45C10