Wunder2dream / RL

basic knowledge about Reinforcement Learning
0 stars 0 forks source link

Model free RL #1

Open Wunder2dream opened 4 years ago

Wunder2dream commented 4 years ago

1)Model-based algorithmis an algorithm that uses the transition function (and the reward function) in order to estimate the optimal policy.

The agent might have access only to an approximation of the transition function and reward functions, which can be learned by the agent while it interacts with the environment or it can be given to the agent (e.g. by another agent).

In general, in a model-based algorithm, the agent can potentially predict the dynamics of the environment (during or after the learning phase), because it has an estimate of the transition function (and reward function). However, note that the transition and reward functions that the agent uses in order to improve its estimate of the optimal policy might just be approximations of the "true" functions. Hence, the optimal policy might never be found (because of these approximations).

2)Model-free algorithm is an algorithm that estimates the optimal policy without using or estimating the dynamics (transition and reward functions) of the environment.

In practice, a model-free algorithm either estimates a "value function" or the "policy" directly from experience (that is, the interaction between the agent and environment), without using neither the transition function nor the reward function. A value function can be thought of as a function which evaluates a state (or an action taken in a state), for all states. From this value function, a policy can then be derived.

look at the algorithms and see if they use the transition or reward function.

Below is Below is a non-exhaustive taxonomy of RL algorithms RLtaxonomy

  1. Prediction and Control

Prediction: This type of task predicts the expected total reward from any given state assuming the function π(a|s) is given. That is to say, Policy π is given, it calculates the Value function Vπ with or without the model.

For example, Model-fee prediction estimates the value function of an unknown MDP as well as the Policy evaluation in Dynamic Programming.

Control: This type of task finds the policy π(a|s) that maximizes the expected total reward from any given state. That is to say, Some Policy π is given , it finds the **Optimal policy π***

For example, Model-free control optimises the value function of an unknown MDP, Policy improvement in Dynamic Programming

Policy iteration is the combination of both to find the optimal policy. Just like in supervised learning , we have regression and classification tasks, in reinforcement learning, we have prediction and control tasks.

  1. On-policy and Off-Policy

On-policy learning: It learns on the job. which means it evaluates or improves the policy that is used to make the decisions.

(In other words) it directly learns a policy which gives you decisions about which action to take in some state.

Off-Policy learning: It evaluates one policy ( target policy ) while following another policy ( behavior policy ) just like we learn to do something while observing others doing the same thing. target policy may be deterministic ( ex: greedy ) while behavior policy is stochastic. 1_QCrhAZlKLQ3p674cG62ZJg

  1. Episodic and Continuous tasks

Episodic task : A task which can last a finite amount of time is called Episodic task ( an episode ) Example: PLaying a game of chess Continuous task : A task which never ends is called Continuous task Example: Trading in the cryptocurrency markets or learning Machine learning on internet.

There are few approaches for solving these kind of problems

  1. Monte-Carlo Reinforcement Learning
  2. Temporal-Difference Learning

1 the episodes have exploring starts Exploration starts: Every state-action pair has a non-zero probability of being the starting pair . 2 Policy evaluation could be done with an infinite number of episodes This assumption is relatively easy to remove. 1)One is to hold firm to the idea of approximating action value function(Qpaik) in each policy evaluation. Measurements and assumptions are made to obtain bounds on the magnitude and probability of error in the estimates, and then sufficient steps are taken during each policy evaluation to assure that these bounds are sufficiently small. 2)A second approach is to avoid the infinite number of episodes nominally required for policy evaluation, in which we give up trying to complete policy evaluation before returning to policy improvement

Exploration/Exploitation trade off

How can they learn about the optimal policy while behaving according to an exploratory policy? 1)The on-policy approach in the preceding section is actually a compromise—it learns action values not for the optimal policy, but for a near-optimal policy that still explores. 2)The off-policy learning: Use two policies, one that is learned about and that becomes the optimal policy(target policy), and one that is more exploratory and is used to generate behavior(behavior policy)

  • assumption of coverage In order to use episodes from b to estimate values for π , we require that every action taken under π is also taken, at least occasionally, under b. π ( a | s ) > 0 implies b ( a | s ) > 0
  • importance sampling Importance sampling is a general technique for estimating expected values under one distribution given samples from another. Importance Sampling We apply importance sampling to off-policy learning by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio. Sampling rate
  • Incremental Implementation Off-Policy Monte Carlo Prediction Off policy MC prediction Off-Plolicy Monte Carlo Control Off policy MC Control

Advantages of TD Prediction Methods

  • Advantage over DP methods in that they do not require a model of the environment, of its reward and next-state probability distributions.
  • Advantage over Monte Carlo methods is that they are naturally implemented in an online, fully incremental fashion.
  • In practice, however, TD methods have usually been found to converge faster than constant- α MC methods on stochastic tasks.

Frage: How can we understand the exploration start in MC control? How can we understand bootstrap in RL ? Is Backup the same as bootstrap?

Wunder2dream commented 4 years ago

Supervised learning is learning from a training set of labeled examples provided by a knowledgable external supervisor. Each example is a description of a situation together with a specification—the label—of the correct action the system should take in that situation, which is often to identify a category to which the situation belongs.

Reinforcement learning: In interactive problems it is often impractical to obtain examples of desired behavior that are both correct and representative of all the situations in which the agent has to act. In RL, an agent must be able to learn from its own experience.

Unsupervised learning: is typically about finding structure hidden in collections of unlabeled data. The terms supervised learning and unsupervised learning would seem to exhaustively classify machine learning paradigms.

RL: Uncovering structure in an agent's experience can certainly be useful in reinforcement learning, but by itself does not address the reinforcement learning problem of maximizing a reward signal.

Wunder2dream commented 4 years ago

What is constrained RL? In this part, i am not sure. where i schould focus on. Because the canstraints in RL are so different in each case. For example, the paper https://arxiv.org/pdf/1812.02900.pdf focus on Batch-Constrained deep Q- Learning(BCQ).

git-thor commented 4 years ago

Good summary and usable as a sound starting point for your further work :-) I'll come back to this tomorrow.

Wunder2dream commented 4 years ago
Wunder2dream commented 4 years ago

n-step methods span a spectrum with MC methods at one end and one-step TD methods at the other. Conclusion: The best methods are often intermediate between the two extremes.

In many applications one wants to be able to update the action very fast to take into account anything that has changed, but bootstrapping works best if it is over a length of time in which a significant and recognizable state change has occurred. 在TD(0)中我们总是在单步观测之后,利用下一个状态的估计值来更新目标值,称之为单步自举。实际上,bootstrapping最好发生在一段时间之后。在这段时间内状态发生明显的变化。

Unified View of Reinforcement Learning image

target of update(更新目标) n-step updates are still TD methods because they still chagne an erlier estimate based on how it differs from a later estimate. Now the later estimate is not one step later, but n steps later.

(1)for MC: the estimate of v π (St) is updated in the direction of the complete return: image

(2)for TD(0)/on- step TD: image

(3)for two-step TD updated based on two-step return: image

(4)for n-step TD based on n-step return: image image Notes: If t + n >= T (if the n-step return extends to or beyond termination), then all the missing terms are taken as zero, and the n-step return defined to be equal to the ordinary full return G t : t + 1 = G t

Note that n-step returns for n > 1 involve future rewards and states that are not available at the time of transition from t to t + 1. No real algorithm can use the n-step return until after it has seen Rt+n and computed Vt+n−1. image image

For example: n-step TD Methods on the Random Walk image 我们使用n -step TD方法来估计一个随机行走问题的值. 通过grid search得到不同学习步长 alpha 和step n 对应的误差。可以看到当 n 取到中间值时,误差最小,再一次说明无论是MC还是TD(0),这种处于极端情况的方法,效果都不太好。

n-step Sarsa algorithm的更新公式为: image

Similarly, our previous n-step Sarsa update can be completely replaced by a simple off-policy form image

注意两种情况下的 importance sampling ratio 的下标, 这是因为我们是在更新一个state-action pair,我们并不关心有多大的概率选中这个action,我们现在已经选中了它,importance sampling只是用于后续actions的选择。这个解释也让我理解了为什么Q-learning和Sarsa为什么没有使用importance sampling。 image

image image image image image image

Wunder2dream commented 4 years ago

Eligible trace(资格迹) - Chapter12 , which enable bootstrapping over multiple time intervals simultaneously.

Wunder2dream commented 3 years ago

argparse modul in Python

Step:

  1. import argparse 导入该模块
  2. parser = argparse.ArgumentParser() 创建一个解析对象
  3. parser.add_argument() 对该对象添加需要关注的参数或选项
  4. parser.parser_args() 最后调用parse_args()方法进行解析 http://zetcode.com/python/argparse/