Policy Gradients (UCB Lecture 5 + Hongyi Lecture 1~2)

Get some samples.
Train the policy function which maximizes the average accumulative rewards in the samples.

What is wrong with the policy gradient?

You might not sample sufficiently.

Say if the reward is all positive:

After normalization, the ones that grow slower will be of very low possibility.

This means bad actions tend to be more frequent than actions that have not been sampled...

So we should have negative rewards...

Say that we are playing chess, if we lose, does that mean all steps in the game are terrible?

Solution => Baseline.

R => (R - b): b here means "baseline" to make negative rewards possible.

Solution => Assign Suitable Credit | Causality(Reward to go)

We can assign different weights/credits to different actions.

Especially when every action can give you some feedback, the weighted reward can be SUM{S_currnet_action ... S_final_action}.

Discount factor.

The policy-based approach is inefficient, as once the parameter is updated, past data is not valuable anymore.

TODO...