ganler / ResearchReading

General system research material (not limited to paper) reading notes.
GNU General Public License v3.0
20 stars 1 forks source link

[Course: UCB CS285 Fall@19 + 李宏毅] Deep Reinforcement Learning #20

Closed ganler closed 3 years ago

ganler commented 4 years ago

UCB CS285

Course webpage: http://rail.eecs.berkeley.edu/deeprlcourse-fa19/index.html#lecture-videos Also see: https://github.com/ganler/ResearchReading/issues/17

李宏毅(RL Course in Chinese)

https://www.bilibili.com/video/BV1uE411K7XK?p=2

ganler commented 4 years ago

Policy Gradients (UCB Lecture 5 + Hongyi Lecture 1~2)

image

What is wrong with the policy gradient?

Sampling instead of real expectation

You might not sample sufficiently.

Positive Reward

Say if the reward is all positive:

After normalization, the ones that grow slower will be of very low possibility.

This means bad actions tend to be more frequent than actions that have not been sampled...

So we should have negative rewards...

Accumulative reward

Say that we are playing chess, if we lose, does that mean all steps in the game are terrible?

Solutions

Solution => Baseline.

R => (R - b): b here means "baseline" to make negative rewards possible.

Solution => Assign Suitable Credit | Causality(Reward to go)

We can assign different weights/credits to different actions.

Especially when every action can give you some feedback, the weighted reward can be SUM{S_currnet_action ... S_final_action}.

image

Discount factor.

image

policy gradient => off-policy

The policy-based approach is inefficient, as once the parameter is updated, past data is not valuable anymore.

TODO...