bigcode-project / Megatron-LM

Ongoing research training transformer models at scale
Other
374 stars 49 forks source link

TF-Model Architecture #2

Closed harm-devries closed 1 year ago

harm-devries commented 2 years ago

Since we do not have much time to experiment with all these components, we have to make decisions based on the current literature.

harm-devries commented 2 years ago

Comment by @dpfried on positional encodings and FIM:

FIM used relative attention - they cite Shaw et al. 2018 and Dai et al. 2019 in their appendix A. In 8.2, they say "Preliminary results, not reported here, indicate that the FIM-for-free property still holds with absolute positional embedding". InCoder (which had a FIM-like loss) used learned absolute positional embeddings.

Experiment suggested by @dpfried to de-risk FIM:

Run 350M parameter model with FIM + learned absolute embeddings on the Python subset and see if that reproduces the results of previously trained non-FIM models

harm-devries commented 2 years ago

On slack, @dpfried and @Stanislas0 discussed multi-lingual vs mono-lingual models by interpreting the findings from this AWS paper. Multi-lingual models perform better than their mono-lingual counterparts after 3B parameters (though the gains vary significantly per programming language). One caveat is that the multi-lingual model was trained for much longer, so it unclear how these models compare if they have similar compute budget.