Scaling laws show that loss follows a smooth plot give compute, data etc, but there is an open question of if loss alone accurately captures the qualitative performance.
What to plot?
Use some pretrained language models (GPT-Neo, GPT-J, GPT2 etc) and create a scatterplot of human evaluation scores vs loss.
Additionally, mix the logits of different models during generation, to match the loss of other models. E.g. mix the logits of GPT-J with GPT-Neo 1.3B in the ratio that would get the same loss as GPT-Neo 2.7B, and compare the human evaluation scores of this mixed model with the original iso-loss model.
Background
Scaling laws show that loss follows a smooth plot give compute, data etc, but there is an open question of if loss alone accurately captures the qualitative performance.
What to plot?
Use some pretrained language models (GPT-Neo, GPT-J, GPT2 etc) and create a scatterplot of human evaluation scores vs loss.
Additionally, mix the logits of different models during generation, to match the loss of other models. E.g. mix the logits of GPT-J with GPT-Neo 1.3B in the ratio that would get the same loss as GPT-Neo 2.7B, and compare the human evaluation scores of this mixed model with the original iso-loss model.
Related Papers/Frameworks
N/A