EleutherAI / project-menu

See the issue board for the current status of active and prospective projects!
65 stars 4 forks source link

Restricted Privilege Value Learning #19

Closed AI-WAIFU closed 1 year ago

AI-WAIFU commented 3 years ago

Project: Human Feedback Without Privileged information channels

Elevator Pitch

Many alignment schemes have a "privileged information" channel that is used for value learning. For instance, learning from human feedback used ranking/voting information from users. This works fine when there are solid boundaries between humans, agents, and environments, but as agent capabilities increase, those lines get blurry, and privileged channels can become a target for manipulation by the agent, at which point the behavior of the agent can become unpredictable. Therefore we should aim to develop alignment schemes that eventually don't need to rely on privileged channels.

Goal Outputs

The goal of the project is to demonstrate eventual value learning without a privileged channel. That is to say an agent should be constructed that picks up on a humans values solely though observations and interactions with the human. To make the task easier, human feedback is allowed during the first phase when the agent is being built.

Milestones