Closed harm-devries closed 1 year ago
Comment by @dpfried on positional encodings and FIM:
FIM used relative attention - they cite Shaw et al. 2018 and Dai et al. 2019 in their appendix A. In 8.2, they say "Preliminary results, not reported here, indicate that the FIM-for-free property still holds with absolute positional embedding". InCoder (which had a FIM-like loss) used learned absolute positional embeddings.
Experiment suggested by @dpfried to de-risk FIM:
Run 350M parameter model with FIM + learned absolute embeddings on the Python subset and see if that reproduces the results of previously trained non-FIM models
On slack, @dpfried and @Stanislas0 discussed multi-lingual vs mono-lingual models by interpreting the findings from this AWS paper. Multi-lingual models perform better than their mono-lingual counterparts after 3B parameters (though the gains vary significantly per programming language). One caveat is that the multi-lingual model was trained for much longer, so it unclear how these models compare if they have similar compute budget.
Since we do not have much time to experiment with all these components, we have to make decisions based on the current literature.