arXiv '22 | The Quest for a Common Model of the Intelligent Decision Maker.

NorbertZheng commented 2 years ago

Richard S. Sutton. The Quest for a Common Model of the Intelligent Decision Maker.

NorbertZheng commented 2 years ago

Abstract

The premise of Multi-Disciplinary Conference on Reinforcement Learning and Decision Making, RLDM is that multiple disciplines share an interest in goal-directed decision making over time. Here, Sutton deepens this idea by proposing a perspective on the decision maker that is substantive and widely used across domains, which is called common model of the intelligent agent. The common model does include:

The aspects of the decision maker's interaction with its world (there must be input and output, and a goal).
The internal components of the decision maker (for perception, decision-making, internal evaluation, and a world model).

Sutton is trying to devise a neutral terminology that can be used across disciplines and build on the convergence of multiple diverse disciplines on a substantive common model of the intelligent agent.

NorbertZheng commented 2 years ago

The Quest

The natural sciences of psychology, neuroscience, and ethology, the engineering sciences of artificial intelligence, optimal control theory, and operations research, and social sciences of economics and anthropology, all focus in part on intelligent decision makers. The perspectives of the various disciplines are different, but they have common elements. One cross-disciplinary goal is to identify the common core, those aspects of the decision-maker that are common to all or many of disciplines. There have been many scientific insights gained from cross-disciplinary interactions, such as the now-widespread use of Bayesian methodsa in psychology, the reward-prediction error interpretation of dopamine in neuroscience, and the longstanding use of the neural-network metaphor in machine learning. In this short paper, Sutton hopes to advance the quest in the following small ways:

Explicitly identify the quest as distinct from fruitful cross-disciplinary interaction.
Highlight the formulation of goals as the maximization of a cumulative numerical signal as highly inter-disciplinary.
Highlight a particular internal structure of the decision-maker.
Highlight terminological differences that obscure the commonalities between fields and offer terms that instead encourage the multi-disciplinary mindset.

NorbertZheng commented 2 years ago

Interface Terminology

The decision-maker makes its decisions over time, which may be divided into discrete steps at each of which:

New information is received.
A decision is made that may affect the information that is received later.

What terminology shall we use for the signals and for the entities exchanging them?

Psychology: The decision maker is the "organism", receiving "stimuli" and sending "responses" to its "environment".
Control Theory: The decision maker is termed the "controller", receiving "state" and sending "control signals" to the "plant".

The components of the system:

Agent: The essence of a decision maker is that it acts with some some autonomy, is sensitive to its input, and has a purposeful effect on its future input.
World: The decision-making agent interact with, everything that is not the agent.
Observation: The information that the agent receive about the state of the world, potentially incomplete, avoiding metaphysical discussions about whether a machine can have "sensation".
Reward: The reward information that the agent receive.
Action: The action that the agent take.

Screenshot_20220302_164144

NorbertZheng commented 2 years ago

Additive Rewards

Most disciplines formulate the agent's goal in terms of a scalar signal generated outside the agent's direct control, and thus we place its generation, formally, in the world. In the general case, this signal arrives on every step and the goal is to maximize its sum. Such additive rewards may be used to formulate the goal as:

A discounted sum, as a sum over a finite horizon.
Average reward per time step
...

A simpler but still popular notion of goal is as a state of the world to be reached. This allows a much more concrete inference, but less general than additive rewards. For example, it cannot:

Handle goals of maintenance (for we are already at the goal state, without cost, we cannot explain why we cannot maintain one thing for a long time).
Specify how time-to-goal and uncertainty are traded off.
...

Maybe we can use successive representation #11 to fulfill this goal, e.g. we use SR to process mental simulation to get the possible way to the goal state. And we can integrate reward as part of state, e.g. we learn the latent state space rather than receive the identified state space in the control theory domain.

NorbertZheng commented 2 years ago

Additive rewards have a long inter-disciplinary history:

Psychology: Reward is used primarily for external objects or events that are pleasing to the animal, and even if that pleasing-ness is derived from an association of the object with something that is rewarding in a more basic way, e.g. a "primary reinforcer".
Others: In operations research, economics, and artificial intelligence, reward is restricted to that more primary signal, and is also more explicitly a received signal rather than being tied to external objects or events.

It seems that I need to understand the stability of grid cells, which are rarely coupled with object-vector cells. It seems that the entorhinal cortex uses an independent encoding way to encode reward and state.

NorbertZheng commented 2 years ago

Standard Components of the Decision-Making Agent

Here, Sutton have opted to include in the agent only the most essential elements for which there is widespread (albeit not universal) agreement with and across disciplines, and to describe them only in general terms. The proposed common model of the internal structure of the agent has four principal components, which are interconnected by a central signal, e.g. the subjective state (seems from the Bayesian perspective):

Perception: Process the stream of observations and actions to produce the subjective state, which is a summary of the agent-world interaction so far (it does consider the history information) that is useful for selecting action (the reactive policy), for predicting future reward (the value function, seems also a predicting problem), and for predicting future subjective states (the transition model). The state is subjective in that it is relative to the agent's observations and actions, and may not correspond to the actual internal workings of the world. The construction of subjective state has two ways:
- Fixed Pre-processing Step: The agent is assumed to receive the subjective state directly as an observation (not necessarily only one frame, but can be a sequence of frames).
- Bayesian Approaches: The subjective state does have a relationship to the internal workings of the world, e.g. the subjective state is intended to approximate the probability distribution over the latent states that the world uses internally.
- Predictive State Methods: The subjective state is a set of predictions.
- Deep Learning: The subjective state is typically the transient activity of a recurrent artificial neural network.
- Control Theory: The computations of the perception component are often referred to as "state identification", or "state estimation".
In general the perception component should have a recursive form allowing the subjective state to be computed efficiently from the preceding subjective state, the most recent observation, and the most recent action, without revisiting the lengthy history of prior observations and actions. The perception component is not just about short-term memory, but also about representation, e.g. it includes any domain-dependent feature construction process. For example, when a person observes a complex visual image and re-represents it in terms of objects and relationships, or observes a chess position and then re-represents it in terms of threats and pawn structure. This means different cortex have different cognitive map, and we need hippocampus to organize them together?
Reactive Policy: This component maps the subjective state to an action. Sometimes perception and the reactive policy are treated together, as in end-to-end learning, but still both functions are generally identifiable.
- Engineering: It is common to assume that perception is given, not learned, and not even part of the agent.
- Psychology: It is common to view perception as some that supports but precedes action, and can study independently of its effects on particular actions.
Value Function: This component maps the subjective state (or state-action pair) to a scalar assessment of its desirability, operationally defined as the expected cumulative reward that follows it.
- Operations Research: Value functions are first extensively developed, in optimal control, operations research, and dynamic programming, as "cost-to-go" functions satisfying Bellman equations, and, in continuous time, Hamilton-Jacobi-Bellman equations.
- Economics: Value functions are called "utility functions", and appear initially as inputs to economic calculations capturing consumer preferences.
- Psychology: Value functions are related to old ideas of "secondary reinforcers" and to newer ideas of reward prediction.
- Neuroscience: The error in the value function, or "reward-prediction error", has been hypothesized as an interpretation of the phasic signal of the neurotransmitter dopamine.
Transition Model: This component takes in states and predicts what next states will result if various actions are taken. The transition model is used to simulate the effects of various actions, and, with the help of the value function, evaluate the possible outcomes and change the reactive policy to favor actions with predicted good outcomes. When this process is done without actually taking the actions, possibly even without visiting the states, then it is appropriate to call it "planning", or even "reason" (seems like "inference", and planning can be seen as inference from a Bayesian prospective).
- Psychology: Internal models of the world such as provided by the transition model together with perception have been prominent models of thought.
- Neuroscience: The hippocampus is now widely seen as providing such a map, and theorists including Karl Friston and Jeff Hawkins have extensively developed brain theories based on this idea.
- Control Theory: In control theory and operations research, it has always been common to use transition models of many forms, including differential equation models, differential equations, and Markov models (seems like the discrete version of differential equation models).
- Reinforcement Learning: Model-based approaches in which the model is learned have long been proposed and are now beginning to effective in large applications. Consider the world model proposed by Yann LeCun.

Screenshot_20220302_195256

NorbertZheng commented 2 years ago

Limitations and Assessment

There is no explicit role in it for predictions of observations other than reward, we should note that transition model doesn't perform prediction actively, e.g. the agent is explicitly trained to predict the future reward rather than the next observation, but it still has no relationship with successive representation.

NorbertZheng / read-papers