TiantongWang / MyBlogs

Personal Notebook
1 stars 0 forks source link

Policy Iteration #10

Open TiantongWang opened 5 years ago

TiantongWang commented 5 years ago

Since there are only finitely many policies in a finite-state, finite-action MDP, it is reasonable to expect that a search over policies should terminate in a finite number of steps.

Idea: for each state s, choose the action that maximizes the expected total reward that will be collected if the baseline policy is used from the next step onwards. This is called roll-out.

given a baseline policy, an improved policy can be computed using roll-out. The improved policy can be further improved by applying roll-out again.

image

TiantongWang commented 4 years ago

Sometimes it's hard to believe time can pass by this fast! Almost a year! Keep progressing!