Closed NorbertZheng closed 2 years ago
The premise of Multi-Disciplinary Conference on Reinforcement Learning and Decision Making, RLDM is that multiple disciplines share an interest in goal-directed decision making over time. Here, Sutton deepens this idea by proposing a perspective on the decision maker that is substantive and widely used across domains, which is called common model of the intelligent agent. The common model does include:
Sutton is trying to devise a neutral terminology that can be used across disciplines and build on the convergence of multiple diverse disciplines on a substantive common model of the intelligent agent.
The natural sciences of psychology, neuroscience, and ethology, the engineering sciences of artificial intelligence, optimal control theory, and operations research, and social sciences of economics and anthropology, all focus in part on intelligent decision makers. The perspectives of the various disciplines are different, but they have common elements. One cross-disciplinary goal is to identify the common core, those aspects of the decision-maker that are common to all or many of disciplines. There have been many scientific insights gained from cross-disciplinary interactions, such as the now-widespread use of Bayesian methodsa in psychology, the reward-prediction error interpretation of dopamine in neuroscience, and the longstanding use of the neural-network metaphor in machine learning. In this short paper, Sutton hopes to advance the quest in the following small ways:
The decision-maker makes its decisions over time, which may be divided into discrete steps at each of which:
What terminology shall we use for the signals and for the entities exchanging them?
The components of the system:
Most disciplines formulate the agent's goal in terms of a scalar signal generated outside the agent's direct control, and thus we place its generation, formally, in the world. In the general case, this signal arrives on every step and the goal is to maximize its sum. Such additive rewards may be used to formulate the goal as:
A simpler but still popular notion of goal is as a state of the world to be reached. This allows a much more concrete inference, but less general than additive rewards. For example, it cannot:
Maybe we can use successive representation #11 to fulfill this goal, e.g. we use SR to process mental simulation to get the possible way to the goal state. And we can integrate reward as part of state, e.g. we learn the latent state space rather than receive the identified state space in the control theory domain.
Additive rewards have a long inter-disciplinary history:
It seems that I need to understand the stability of grid cells, which are rarely coupled with object-vector cells. It seems that the entorhinal cortex uses an independent encoding way to encode reward and state.
Here, Sutton have opted to include in the agent only the most essential elements for which there is widespread (albeit not universal) agreement with and across disciplines, and to describe them only in general terms. The proposed common model of the internal structure of the agent has four principal components, which are interconnected by a central signal, e.g. the subjective state (seems from the Bayesian perspective):
Perception: Process the stream of observations and actions to produce the subjective state, which is a summary of the agent-world interaction so far (it does consider the history information) that is useful for selecting action (the reactive policy), for predicting future reward (the value function, seems also a predicting problem), and for predicting future subjective states (the transition model). The state is subjective in that it is relative to the agent's observations and actions, and may not correspond to the actual internal workings of the world. The construction of subjective state has two ways:
In general the perception component should have a recursive form allowing the subjective state to be computed efficiently from the preceding subjective state, the most recent observation, and the most recent action, without revisiting the lengthy history of prior observations and actions. The perception component is not just about short-term memory, but also about representation, e.g. it includes any domain-dependent feature construction process. For example, when a person observes a complex visual image and re-represents it in terms of objects and relationships, or observes a chess position and then re-represents it in terms of threats and pawn structure. This means different cortex have different cognitive map, and we need hippocampus to organize them together?
There is no explicit role in it for predictions of observations other than reward, we should note that transition model doesn't perform prediction actively, e.g. the agent is explicitly trained to predict the future reward rather than the next observation, but it still has no relationship with successive representation.
Richard S. Sutton. The Quest for a Common Model of the Intelligent Decision Maker.