[RFP] Low and high order coherence as model loss improves

Background

We predict that LMs will learn low order coherence (i.e grammar) first, and then learn higher order coherence (i.e logical soundness) later (where by later I mean lower loss). The question is, even if this seems intuitively correct, can we actually observe this in real models? If it does, this seems to provide important insights into how learning dynamics are and lets us infer a more discontinuous (or more continuous, who knows) improvement in capabilities in the future.

What to plot?

Take either one model at a ton of different checkpoints, or a series of comparable models trained to x tokens. Figure out some way to measure grammaticalness, logical coherency, etc of outputs (probably want prompts of a genre where a lack of logical coherence is obvious), using either human feedback or automatic metrics or a mix. For each of these things, plot it vs loss/compute/params/etc, and see if higher order stuff only starts improving after lower order stuff saturates.

Related Papers/Frameworks

See https://www.gwern.net/Scaling-hypothesis#why-does-pretraining-work and https://www.alignmentforum.org/posts/EmxfgPGvaKqhttPM8/thoughts-on-the-alignment-implications-of-scaling-language for some relevant thoughts.

EleutherAI / project-menu