Model free RL - Githubissues

Wunder2dream commented 4 years ago

Model-Free vs Model-Based RL

1)Model-based algorithmis an algorithm that uses the transition function (and the reward function) in order to estimate the optimal policy.

The agent might have access only to an approximation of the transition function and reward functions, which can be learned by the agent while it interacts with the environment or it can be given to the agent (e.g. by another agent).

In general, in a model-based algorithm, the agent can potentially predict the dynamics of the environment (during or after the learning phase), because it has an estimate of the transition function (and reward function). However, note that the transition and reward functions that the agent uses in order to improve its estimate of the optimal policy might just be approximations of the "true" functions. Hence, the optimal policy might never be found (because of these approximations).

2)Model-free algorithm is an algorithm that estimates the optimal policy without using or estimating the dynamics (transition and reward functions) of the environment.

In practice, a model-free algorithm either estimates a "value function" or the "policy" directly from experience (that is, the interaction between the agent and environment), without using neither the transition function nor the reward function. A value function can be thought of as a function which evaluates a state (or an action taken in a state), for all states. From this value function, a policy can then be derived.

How to distinguish between model-based and model-free algorithms in practice?

look at the algorithms and see if they use the transition or reward function.

Below is Below is a non-exhaustive taxonomy of RL algorithms

What to Learn in Model-Free RL
some basic concepts

Prediction and Control

Prediction: This type of task predicts the expected total reward from any given state assuming the function π(a|s) is given. That is to say, Policy π is given, it calculates the Value function Vπ with or without the model.

For example, Model-fee prediction estimates the value function of an unknown MDP as well as the Policy evaluation in Dynamic Programming.

Control: This type of task finds the policy π(a|s) that maximizes the expected total reward from any given state. That is to say, Some Policy π is given , it finds the **Optimal policy π***

For example, Model-free control optimises the value function of an unknown MDP, Policy improvement in Dynamic Programming

Policy iteration is the combination of both to find the optimal policy. Just like in supervised learning , we have regression and classification tasks, in reinforcement learning, we have prediction and control tasks.

On-policy and Off-Policy

On-policy learning: It learns on the job. which means it evaluates or improves the policy that is used to make the decisions.

(In other words) it directly learns a policy which gives you decisions about which action to take in some state.

Off-Policy learning: It evaluates one policy ( target policy ) while following another policy ( behavior policy ) just like we learn to do something while observing others doing the same thing. target policy may be deterministic ( ex: greedy ) while behavior policy is stochastic.

Episodic and Continuous tasks

Episodic task : A task which can last a finite amount of time is called Episodic task ( an episode ) Example： PLaying a game of chess Continuous task : A task which never ends is called Continuous task Example: Trading in the cryptocurrency markets or learning Machine learning on internet.

Model Free Methods

In Model-free RL , we just focus on figuring out the value functions directly from the interactions with the environment All model free learning algorithms are gonna learn value functions directly from the environment.
How to figure out value function for unknown MDP( assume we get the policy )?

There are few approaches for solving these kind of problems

Monte-Carlo Reinforcement Learning
Temporal-Difference Learning

Monte-Carlo Policy Evaluation

Monte-Carlo learn directly from episodes of experience, get the reward at the end of an episode. MC learns from complete episodes: no bootstrapping; MC uses the simplest possible idea: value = empirical mean return; MC can only apply to episodic MDPs
There are two diferent types of MC Policy Evaluation
1. First-Visit Monte-Carlo Policy Evaluation In short, we average returns only for first time -s- is visited in an episode
2. Every-Visit Monte-Carlo Policy Evaluation In short we average returns for every time -s- is visited in an episode The algorithm for First-visit MC prediction Notes: V(s) is the average of G, we introduce the Incremental Mean to help us calculate the mean when we don't have all the values. μk = the mean of K items
Monte-Carlo Control
two unlikely assumptions which guarantee the convergence for the Monte Carlo Method

1 the episodes have exploring starts Exploration starts: Every state-action pair has a non-zero probability of being the starting pair . 2 Policy evaluation could be done with an infinite number of episodes This assumption is relatively easy to remove. 1)One is to hold firm to the idea of approximating action value function(Qpaik) in each policy evaluation. Measurements and assumptions are made to obtain bounds on the magnitude and probability of error in the estimates, and then sufficient steps are taken during each policy evaluation to assure that these bounds are sufficiently small. 2)A second approach is to avoid the infinite number of episodes nominally required for policy evaluation, in which we give up trying to complete policy evaluation before returning to policy improvement

Monte-Carlo ES Assumption: Exploration starts Notes: 1)Monte Carlo ES cannot converge to any suboptimal policy. 2) Stability is achieved only when both the policy and the value function are optimal.
Monte-Carlo Control without Exploring Starts 1）How can we avoid the unlikely assumption of exploring starts? The only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them. 2) Two approaches to ensure this One is On-policy method, the other is Off-policy method On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas Off-policy methods evaluate or improve a policy different from that used to generate the data. _3) $ε$ soft and $ε$ greedy In on-policy control methods the policy is generally soft, meaning that
$π (a \| s) > 0, \forall s \in S, a \in A$ , but gradually shifted closer and closer to a deterministic optimal policy. $ε$ greedy means , that most of the time they choose an action that has maximal estimated action value, but with probability $ε$ they instead select an action at random. $\frac{ε}{\| A (s) \|}$ probability take nongreedy actions $1 - ε + \frac{ε}{\| A (s) \|}$ takes the greedy action. ![OnPolicy MC](https://user-images.githubusercontent.com/44496626/83551587-ce702980-a508-11ea-9ba9-32baee98bb81.png)

Exploration/Exploitation trade off

All learning control methods face a dilemma: They seek to learn action values conditional on subsequent optimal behavior(Exploitation), but they need to behave non-optimally in order to explore all actions (to find the optimal actions)(Exploitation).

How can they learn about the optimal policy while behaving according to an exploratory policy? 1)The on-policy approach in the preceding section is actually a compromise—it learns action values not for the optimal policy, but for a near-optimal policy that still explores. 2)The off-policy learning: Use two policies, one that is learned about and that becomes the optimal policy(target policy), and one that is more exploratory and is used to generate behavior(behavior policy)

assumption of coverage In order to use episodes from b to estimate values for $π$ , we require that every action taken under $π$ is also taken, at least occasionally, under b. $π (a | s) > 0$ implies $b (a | s) > 0$

importance sampling Importance sampling is a general technique for estimating expected values under one distribution given samples from another. We apply importance sampling to off-policy learning by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio.

Incremental Implementation Off-Policy Monte Carlo Prediction Off-Plolicy Monte Carlo Control

Temporal-Difference Learning

TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. 1)Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. 2)Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).
TD Prediction Different ways to update in Monte Carlo, DP and TD 1)Monte-Carlo: where Gt is Monte Carlo Target, and must wait until the end of the episode to dertermin the in increment to V(St) 2) TD:
At time t + 1 they immediately form a target and make a useful update using the observed reward Rt+1 and the estimate V (St+1). 3) DP: here are the backup diagrams for all three methods
TD and Monte Carlo: sample updates,based on a single sample successor (use the value of the successor and the reward along the way to compute a backed-up value, and then updating the value of the original state)
DP methods: expected updates, based on a complete distribution of all possible successors.
The pseudo code Summary: 1)Bootstrapping and Sampling Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps Sampling MC samples DP does not sample TD samples 2)TD-error $δ_{t}$
$δ_{t}$ is the error $V (S_{t})$ , available at time t+1 3)the Monte Carlo error can be written as a sum of TD errors: _if the array V does not change during the episode (as it does not in Monte Carlo methods)_ ![image](https://user-images.githubusercontent.com/44496626/83670680-d517a300-a5d3-11ea-9ca6-86a2a98f9f01.png)

Advantages of TD Prediction Methods

Advantage over DP methods in that they do not require a model of the environment, of its reward and next-state probability distributions.

Advantage over Monte Carlo methods is that they are naturally implemented in an online, fully incremental fashion.

In practice, however, TD methods have usually been found to converge faster than constant- $α$ MC methods on stochastic tasks.

TD Control There are two algorithms in TD control
Sarsa( state-action-reward-state-action)： On-policy TD Control Sarsa is an on-policy TD control method. In the previous section we considered transitions from state to state and learned the values of states. Now we consider transitions from state–action pair to state–action pair, and learn the values of state–action pairs. here the TD error The backup diagram for Sarsa is as shown as below,
Sarsa control algorithm
Q-Learning: Off- policy TD Control An off-policy TD control algorithm, defined by the target policy(learned action-value function) directly aproximates $q_{*}$ , independent of the behaviour policy. The backup diagramm for Q-learning is
Q-learning algorithm
Expected Sarsa Similar to Q-learning Instead of using the maximum over next state–action pairs it uses the expected value, taking into account, The backup diagramm for Q-learning is

Frage: How can we understand the exploration start in MC control? How can we understand bootstrap in RL ? Is Backup the same as bootstrap?

Wunder2dream commented 4 years ago

Reinforcement learning vs. supervised learning

Supervised learning is learning from a training set of labeled examples provided by a knowledgable external supervisor. Each example is a description of a situation together with a specification—the label—of the correct action the system should take in that situation, which is often to identify a category to which the situation belongs.

Reinforcement learning: In interactive problems it is often impractical to obtain examples of desired behavior that are both correct and representative of all the situations in which the agent has to act. In RL, an agent must be able to learn from its own experience.

RL vs. unsupervised learning

Unsupervised learning: is typically about finding structure hidden in collections of unlabeled data. The terms supervised learning and unsupervised learning would seem to exhaustively classify machine learning paradigms.

RL: Uncovering structure in an agent's experience can certainly be useful in reinforcement learning, but by itself does not address the reinforcement learning problem of maximizing a reward signal.

Wunder2dream commented 4 years ago

What is constrained RL? In this part, i am not sure. where i schould focus on. Because the canstraints in RL are so different in each case. For example, the paper https://arxiv.org/pdf/1812.02900.pdf focus on Batch-Constrained deep Q- Learning(BCQ).

git-thor commented 4 years ago

Good summary and usable as a sound starting point for your further work :-) I'll come back to this tomorrow.

Wunder2dream commented 4 years ago

What is the difference between Forward view linear TD(Lambda) and Backward view linear ；ß -
TD(Lambda)?
TD(Lambda) vs. TD(0)
Coarse Coding 值函数逼近
[ ] DeepTraffic https://github.com/Wunder2dream/deep-traffic-2019 阅读测试
[ ] https://github.com/Wunder2dream/highway-env 用DeepQNetwork测试
[ ] 6.1 Derivative-Free Methods for Optimal Control
[ ] Open AI Safety Gym

Wunder2dream commented 4 years ago

n-step Bootstrapping In this chapter we unify the Monte Carlo (MC) methods and the one-step temporal difference (TD) methods Neither MC methods nor one-step TD methods are always the best. therefore n-step TD method is presented

n-step methods span a spectrum with MC methods at one end and one-step TD methods at the other. Conclusion: The best methods are often intermediate between the two extremes.

In many applications one wants to be able to update the action very fast to take into account anything that has changed, but bootstrapping works best if it is over a length of time in which a significant and recognizable state change has occurred. 在TD(0)中我们总是在单步观测之后，利用下一个状态的估计值来更新目标值，称之为单步自举。实际上，bootstrapping最好发生在一段时间之后。在这段时间内状态发生明显的变化。

Unified View of Reinforcement Learning

n-step TD prediction the first question: What is the space of methods lying between Monte Carlo and TD methods? 在MC和TD之间存在怎样的一个方法空间或者一类算法？
Let TD target look n steps into the future The backup diagrams of n-step methods. These methods form a spectrum ranging from one-step TD methods to Monte Carlo methods.

target of update(更新目标) n-step updates are still TD methods because they still chagne an erlier estimate based on how it differs from a later estimate. Now the later estimate is not one step later, but n steps later.

(1)for MC: the estimate of $v_{π}$ (St) is updated in the direction of the complete return:

(2)for TD(0)/on- step TD:

(3)for two-step TD updated based on two-step return:

(4)for n-step TD based on n-step return: Notes: If t + n >= T (if the n-step return extends to or beyond termination), then all the missing terms are taken as zero, and the n-step return defined to be equal to the ordinary full return $G_{t : t + 1} = G_{t}$

Note that n-step returns for n > 1 involve future rewards and states that are not available at the time of transition from t to t + 1. No real algorithm can use the n-step return until after it has seen Rt+n and computed Vt+n−1.

Error reduction property n-step return有一个重要的属性叫做error reduction property，在最坏的情况下，n-step returns的期望也是一个比 $V_{t + n - 1}$ 更好的估计：

For example: n-step TD Methods on the Random Walk 我们使用n -step TD方法来估计一个随机行走问题的值. 通过grid search得到不同学习步长 alpha 和step n 对应的误差。可以看到当 n 取到中间值时，误差最小，再一次说明无论是MC还是TD(0)，这种处于极端情况的方法，效果都不太好。

n- step Sarsa How can n-step methods be used not just for prediction, but for control? The main idea is to simply switch states for actions (state–action pairs) and then use an epsilon-greedy policy. We redefine n-step returns (update targets) in terms of estimated action values:

n-step Sarsa algorithm的更新公式为：

n-step Sarsa 的backup图如下：
n- step expected Sarsa This algorithm can be described by the same equation as n-step Sarsa (above) except with the n-step return redefined as 同Sarsa和expected Sarsa的区别一样，我们只是将更新目标的最后一项换成期望值如果s是terminal，它的期望是0
n-step Off-policy Learning Recall that off-policy learning is learning the value function for one policy, pi, while following another policy, b. Often, pi is the greedy policy for the current action-value function estimate, and b is a more exploratory policy, perhaps "epislon-greedy. In order to use the data from b we must take into account the difference between the two policies, using their relative probability of taking the actions that were taken the value function is,

Similarly, our previous n-step Sarsa update can be completely replaced by a simple off-policy form

注意两种情况下的 importance sampling ratio 的下标，这是因为我们是在更新一个state-action pair，我们并不关心有多大的概率选中这个action，我们现在已经选中了它，importance sampling只是用于后续actions的选择。这个解释也让我理解了为什么Q-learning和Sarsa为什么没有使用importance sampling。

n-step Tree Backup Algorithm(Off-policy Learning Without Importance Sampling) Is off-policy learning possible without importance sampling? Q-learning and Expected Sarsa from Chapter 6 do this for the one-step case, but is there a corresponding multi-step algorithm? 算法的核心思想：一是采用树的结构，二是在树结构上执行备份操作。

回想什么叫备份？通过后续状态的值更新当前状态值的操作，叫做备份。所有的更新操作都可以叫做备份，区别在于用那些状态备份，如何备份。 Hanging off to the sides of each state are the actions that were not selected. (For the last state, all the actions are considered to have not (yet) been selected.) Because we have no sample data for the unselected actions, we bootstrap and use the estimates of their values in forming the target for the update. In the tree-backup update, the target includes all these things plus the estimated values of the dangling action nodes hanging off the sides, at all levels. This is why it is called a treebackup update; it is an update from the entire tree of estimated action values.