FrancisLeon commented 7 years ago

1.3 Elements of Reinforcement Learning

Policy
- A policy defines the learning agent’s way of behaving at a given time.
- Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.
reward signal
- A reward signal defines the goal in a reinforcement learning problem:
- The agent’s sole objective is to maximize the total reward it receives over the long run.
- On each time step, the environment sends to the reinforcement learning agent a single number, a reward.
- The only way the agent can influence the reward signal is through its actions, which can have a direct e↵ect on reward, or an indirect effect through changing the environment’s state:
  - S, A, R, S
- value function
- Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run.
- Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state.
- it is values with which we are most concerned when making and evaluating decisions.
Reward and Value
- We seek actions that bring about states of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run.
- But it is much harder to determine values than it is to determine rewards.
  - Rewards are basically given directly by the environment
  - values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime.
most important component of RL
- a method for effciently estimating values
a model of the environment
- This is something that mimics the behavior of the environment, or more generally, that allows inferences to be made about how the environment will behave: planning rather than actually experienced
- Models are used for planning, by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced.
- model-based and modelfree:
  - Methods for RL that use models and planning
  - explicitly trial-and-error learners—viewed as almost the opposite of planning

FrancisLeon commented 7 years ago

1.4 Limitations and Scope

Reinforcement learning’s connection to optimization methods deserves some additional comment because it is a source of a common misunderstanding.
- Trying to maximize a quantity does not mean that that quantity is ever maximized.

FrancisLeon commented 7 years ago

1.6 Summary

Reinforcement learning is a computational approach to understanding and automating goal-directed learning and decision-making. It is distinguished from other computational approaches by its emphasis on learning by an agent from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment. In our opinion, reinforcement learning is the first field to seriously address the computational issues that arise when learning from interaction with an environment in order to achieve long-term goals.

Reinforcement learning uses a formal framework defining the interaction between a learning agent and its environment in terms of states, actions, and rewards. This framework is intended to be a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and e↵ect, a sense of uncertainty and nondeterminism, and the existence of explicit goals.

The concepts of value and value functions are the key features of most of the reinforcement learning methods that we consider in this book. We take the position that value functions are important for effcient search in the space of policies. Their use of value functions distinguishes reinforcement learning methods from evolutionary methods that search directly in policy space guided by scalar evaluations of entire policies.

FrancisLeon commented 7 years ago

2 Multi-arm Bandits

Preface

The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions.

Evaluative feedback	Instructive feedback
basis of methods for function optimization	indicates the correct action to take
depends entirely on the action taken	independent of the action taken

FrancisLeon commented 7 years ago

2.1 An n-Armed Bandit Problem

Description:
- You are faced repeatedly with achoice among n different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected.
- objective: to maximize the expected total reward over some time period.
Value:
- each action has an expected or mean reward given that that action is selected.
Policy:
- you would always select the action with highest value.
exploiting and exploring:
- exploiting: If you maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest (greedy action).
- exploring: If instead you select one of the nongreedy actions, then we say you are exploring, because this enables you to improve your estimate of the nongreedy action’s value.
- Exploitation is the right thing to do to maximize the expected reward on the one step, but exploration may produce the greater total reward in the long run.
experiments:

FrancisLeon commented 7 years ago

3 Finite Markov Decision Processes

We try to convey the wide range of possible applications that can be framed as reinforcement learning tasks. We also describe mathematically idealized forms of the reinforcement learning problem for which precise theoretical statements can be made.
We introduce key elements of the problem’s mathematical structure, such as value functions and Bellman equations.
As in all of artificial intelligence, there is a tension between breadth of applicability and mathematical tractability. In this chapter we introduce this tension and discuss some of the trade-offs and challenges that it implies.
3.1 The Agent–Environment Interface
agent：The learner and decision-maker
environment： The thing it interacts with, comprising everything outside the agent
bound between agent and environment:
- The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.
- We always consider the reward computation to be external to the agent because it defines the task facing the agent and thus must be beyond its ability to change arbitrarily.
- Sometimes, the enviroment is not defined by the physical barrier, say, in the robot example:
  - A clean cans robotor: target clean considerable number of cans under the limit of battery level.
  - Reward: (1) clean a can, gain +1; (2) run out of battery, gain -1; There are three options it can make: (1) actively clean the cans on the road; (2) stay stationary; (3) go back to recharge the battery;
  - State: level of the battery
  - Environment: the environment ouside the robot and the battery
Abstraction of decision make: As you can see, environment is what the decision maker interacts with. As, the decision maker choose a choice, (interact with the environment), the environment send the state (the basis on which the choices are made) to agent as well as the reward (one signal to define the agent’s goal). This three elements can always be abstracted from most of problems.
3.2 Goals and Rewards

The purpose or goal of the agent is formalized in terms of a special reward signal passing from the environment to the agent.

reward hypothesis: That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal.

why the computation of reward is outside of the agent?
- (1) The explanation from zijie chai : we may define the immediate reward, but the computation reward say, expectation you may not know. The agent may learn it.
- (2) The explanation in book: The reason we do this is that the agent’s ultimate goal should be something over which it has imperfect control: it should not be able, for example, to simply decree that the reward has been received in the same way that it might arbitrarily change its actions.
- (3) The understanding of mine: the reward is sent from the environment. The agent may take any actions, but the received reward is not that arbitrarily, it depends on the environment.
- (4) Keep it for further discussion....

3.3 Returns

We have said that the agent’s goal is to maximize the cumulative reward it receives in the long run.

How might this be defined formally ?
If the sequence of rewards received after time step t is denoted R_t+1,R_t+2,R_t+3,..., then what precise aspect of this sequence do we wish to maximize?
- Simply, we may just sum them up.
- In continuing case, the sum may be infinite, so we may multiply a discount
- Sometimes the continuing case and episodic case can be merged, as the case follows:
Example 3.4: Pole-Balancing This task could be treated as episodic, where the natural episodes are the repeated attempts to balance the pole. The reward in this case could be +1 for every time step on which failure did not occur, so that the return at each time would be the number of steps until failure. Alternatively, we could treat pole-balancing as a continuing task, using discounting. In this case the reward would be $1 on each failure and zero at all other times.
3.5 The Markov Property

By “the state” we mean whatever information is available to the agent. Moreover, what we would like, ideally, is a state signal that summarizes past sensations compactly, yet in such a way that all relevant information is retained. However, this requires never more than the complete history of all past sensations. A state signal that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property.

default

It's helpful to the decision make.
- If an environment has the Markov property, then its one-step dynamics (3.5) enable us to predict the next state and expected next reward given the current state and action.
- It also follows that Markov states provide the best possible basis for choosing actions, since a function of it is same as a function of complete histories.

3.6 Markov Decision Processes

A reinforcement learning task that satisfies the Markov property is called a Markov decision process, or MDP. If the state and action spaces are finite, then it is called a finite Markov decision process (finite MDP).

FrancisLeon / Reinforement-Learning-

RL book #3

1.3 Elements of Reinforcement Learning

1.4 Limitations and Scope

1.6 Summary

2 Multi-arm Bandits

Preface

2.1 An n-Armed Bandit Problem

3 Finite Markov Decision Processes

3.1 The Agent–Environment Interface

3.2 Goals and Rewards

3.3 Returns

3.5 The Markov Property

3.6 Markov Decision Processes