At the moment, we only have language-modelling loss to go by when experimenting with different architectures. Unfortunately, many methods, such as extra-gradient methods, different loss functions, different tokenisers or even different datasets, will change these loss values dramatically, making comparison almost impossible. We would gain certainty by integrating a dedicated evaluation pipeline such as EleutherAI's eval-harness that one model is better than the other and allow us to compare ourselves with existing models such as GPT-J and GPT-3.
At the moment, we only have language-modelling loss to go by when experimenting with different architectures. Unfortunately, many methods, such as extra-gradient methods, different loss functions, different tokenisers or even different datasets, will change these loss values dramatically, making comparison almost impossible. We would gain certainty by integrating a dedicated evaluation pipeline such as EleutherAI's eval-harness that one model is better than the other and allow us to compare ourselves with existing models such as GPT-J and GPT-3.