Project: Human Feedback Without Privileged information channels

Elevator Pitch

Many alignment schemes have a "privileged information" channel that is used for value learning. For instance, learning from human feedback used ranking/voting information from users. This works fine when there are solid boundaries between humans, agents, and environments, but as agent capabilities increase, those lines get blurry, and privileged channels can become a target for manipulation by the agent, at which point the behavior of the agent can become unpredictable. Therefore we should aim to develop alignment schemes that eventually don't need to rely on privileged channels.

Goal Outputs

The goal of the project is to demonstrate eventual value learning without a privileged channel. That is to say an agent should be constructed that picks up on a humans values solely though observations and interactions with the human. To make the task easier, human feedback is allowed during the first phase when the agent is being built.

Milestones

Construct a value learning system
Create a collection of tasks to use as value learning targets
Develop a model with a sufficiently long context that it can meta-learn value learning from a few examples
How to Help
Contribute to building either a UI or mechanical turk tasks for value learning.
Help with Eleuther's value learning efforts/EEGI
Desired Support
Tacit knowledge from the EEGI project
Money for mechanical turk expenses
Compute to train/run a model capable with decent meta-learning capabilities.

EleutherAI / project-menu

Restricted Privilege Value Learning #19

Project: Human Feedback Without Privileged information channels

Elevator Pitch

Goal Outputs

Milestones

How to Help

Desired Support