HomebrewNLP / Olmax

HomebrewNLP in JAX flavour for maintable TPU-Training
BSD 2-Clause "Simplified" License
45 stars 6 forks source link

Scaling #13

Closed ClashLuke closed 2 years ago

ClashLuke commented 2 years ago

Most transformers increase drastically in performance when they are scaled up. ViT-G showed this for vision transformers and Chinchilla for language models. However, as we're not using a transformer, it's uncertain whether we'll see similar improvements.\ This issue is about "scaling up" and tracks the progress of large-scale models. Once it's finished, we should be able to run our models on v3-32s, v3-256s and bigger. Using those 32x bigger TPUs, we aim for at least 16x faster steps (in tokens * parameters/second).