[RFP] How strong is the inductive bias of model size for higher order coherence?

Background

Larger models seem to reach the same loss faster. This seems to imply they're more sample efficient, however that assumption assumes all models of the same loss have about the same subjective quality (which is almost certainly not true, you can construct pathological examples to show failure cases of this assumption, but what I'm more interested in is if this assumption holds in practice). For instance, larger models might be better at memorizing low order coherence, and therefore they'd get better loss but have worse higher order coherence. Or maybe it's the other way around and there's some inductive bias to having a lot of parameters that makes you more likely to learn high order coherence (the prior leans towards this case since anecdotally GPT3 seems more high-order coherent too). This is very alignment relevant because it tells us how future big models might behave.

Does a larger model of the same loss as a smaller model have exactly the same subjective quality and downstream task performance?

What to plot?

Train a series of models of different sizes to the same val loss. Compare eval harness scores and subjective evaluations for higher order coherence (i.e grammar, logical coherence) in particular.

Related Papers/Frameworks

Tangentially related to #20. Very similar to #24

EleutherAI / project-menu