Since there are only finitely many policies in a finite-state, finite-action MDP, it is reasonable to expect that a search over policies should terminate in a finite number of steps.
Idea: for each state s, choose the action that maximizes the expected total reward that will be collected if the baseline policy is used from the next step onwards. This is called roll-out.
given a baseline policy, an improved policy can be computed using roll-out. The improved policy can be further improved by applying roll-out again.
Since there are only finitely many policies in a finite-state, finite-action MDP, it is reasonable to expect that a search over policies should terminate in a finite number of steps.
Idea: for each state s, choose the action that maximizes the expected total reward that will be collected if the baseline policy is used from the next step onwards. This is called roll-out.
given a baseline policy, an improved policy can be computed using roll-out. The improved policy can be further improved by applying roll-out again.