Most transformers increase drastically in performance when they are scaled up. ViT-G showed this for vision transformers and Chinchilla for language models. However, as we're not using a transformer, it's uncertain whether we'll see similar improvements.\
This issue is about "scaling up" and tracks the progress of large-scale models. Once it's finished, we should be able to run our models on v3-32s, v3-256s and bigger. Using those 32x bigger TPUs, we aim for at least 16x faster steps (in tokens * parameters/second).
Most transformers increase drastically in performance when they are scaled up. ViT-G showed this for vision transformers and Chinchilla for language models. However, as we're not using a transformer, it's uncertain whether we'll see similar improvements.\ This issue is about "scaling up" and tracks the progress of large-scale models. Once it's finished, we should be able to run our models on v3-32s, v3-256s and bigger. Using those 32x bigger TPUs, we aim for at least 16x faster steps (in tokens * parameters/second).