TF-Model Architecture - Githubissues

bigcode-project / Megatron-LM

Ongoing research training transformer models at scale

Other

376 stars 49 forks source link

TF-Model Architecture #2

Closed harm-devries closed 1 year ago

harm-devries commented 2 years ago

[x] Port Hailey's implementation of Fill In the Middle (FIM)
[ ] Positional encodings (e.g., rotary/alibi) (already implemented in Megatron-LM)
[ ] Multi Query Attention (already implemented in Megatron-LM)
[ ] Model size
[ ] Compute/data budget per language (which programming languages?? )

Since we do not have much time to experiment with all these components, we have to make decisions based on the current literature.

harm-devries commented 2 years ago

Comment by @dpfried on positional encodings and FIM:

FIM used relative attention - they cite Shaw et al. 2018 and Dai et al. 2019 in their appendix A. In 8.2, they say "Preliminary results, not reported here, indicate that the FIM-for-free property still holds with absolute positional embedding". InCoder (which had a FIM-like loss) used learned absolute positional embeddings.

Experiment suggested by @dpfried to de-risk FIM:

Run 350M parameter model with FIM + learned absolute embeddings on the Python subset and see if that reproduces the results of previously trained non-FIM models

harm-devries commented 2 years ago

On slack, @dpfried and @Stanislas0 discussed multi-lingual vs mono-lingual models by interpreting the findings from this AWS paper. Multi-lingual models perform better than their mono-lingual counterparts after 3B parameters (though the gains vary significantly per programming language). One caveat is that the multi-lingual model was trained for much longer, so it unclear how these models compare if they have similar compute budget.