btaba / intro-to-rl

coding examples to Intro to RL
MIT License
12 stars 6 forks source link

monte-carlo off-policy #3

Closed xubo92 closed 7 years ago

xubo92 commented 7 years ago

hi @btaba: Have you ever try off-policy method on racetrack problem? I tried but found the performance is so bad. I found something that seems important in Sutton's book :

The off-policy method is only valid if the environment is such that all policies are proper, meaning that they produce episodes that always eventually terminate (this assumption was made on the first page of this chapter). This restriction can be lifted if the algorithm is modified to use ε-soft policies, which are proper for all environments. What modifications are needed to the algorithm to restrict it to ε-soft policies?

   Do you know what is the meaning of that? confusing for a few days
btaba commented 7 years ago

Off-policy on the racetrack environment I made won't work well, precisely because of that comment from Sutton's book. Off-policy means that you have some policy executing in your environment that is different than the policy you are optimizing. For example, you could have some random policy. You wouldn't expect a random policy to ever really finish an episode successfully in the racetrack environment, because it is too complicated, so the agent will never learn how to complete the racetrack.

xubo92 commented 7 years ago

@btaba sorry for my delayed reply :) The explanation is very reasonable. Thanks a lot.