[Idea] Are big LMs mesaoptimizing?

Motivation

Mesaoptimization in big LMs would be kind of concerning, and would make any LM+RL really unsafe.

Main problems to figure out are:

How do we want to define mesaoptimization
What would be evidence of mesaoptimization in a LM
What implications on larger models does this have

I don't have satisfying answers for these yet. _

Hypothesis/Conjecture

Big LMs might be mesaoptimizing -- seems plausible given how LMs can model agenty things. _

Proposed Experiments(Or series of Experiments)

Let know what you people think about the hypothesis and design of experiments, in the comments below! Also, feel free to propose new/better experiments.

EleutherAI / project-menu

[Idea] Are big LMs mesaoptimizing? #23

Motivation

Hypothesis/Conjecture

Proposed Experiments(Or series of Experiments)