[RFP] Qualitative evaluations of mixing language models

Background

Scaling laws show that loss follows a smooth plot give compute, data etc, but there is an open question of if loss alone accurately captures the qualitative performance.

What to plot?

Use some pretrained language models (GPT-Neo, GPT-J, GPT2 etc) and create a scatterplot of human evaluation scores vs loss.

Additionally, mix the logits of different models during generation, to match the loss of other models. E.g. mix the logits of GPT-J with GPT-Neo 1.3B in the ratio that would get the same loss as GPT-Neo 2.7B, and compare the human evaluation scores of this mixed model with the original iso-loss model.

Related Papers/Frameworks

N/A

EleutherAI / project-menu