ShangtongZhang / reinforcement-learning-an-introduction

Python Implementation of Reinforcement Learning: An Introduction
MIT License
13.54k stars 4.82k forks source link

Chapter 11 #126

Closed mattgithub1919 closed 4 years ago

mattgithub1919 commented 4 years ago

Hello,

Thank you for your work. I have a question about the semi_gradient_off_policy_TD function. It looks like it is using on-policy update in line 79 as next_state is uniform selection of 7 states while under off-policy it should only select LOWER STATE. In my understanding, Figure 11.2 does off-policy, not on-policy. Correct me if I am wrong. Thank you.

Warm regards, Matt

ShangtongZhang commented 4 years ago

https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/blob/master/chapter11/counterexample.py#L73 It depends on action, doesn't it?

mattgithub1919 commented 4 years ago

Thank you for your response. I think in Figure 11.2, the target policy doesn't depend on action, it is 100% selecting LOWER STATE. You can check the highlighted sentences in the following pic.

Screen Shot 2020-04-05 at 5 52 22 PM
ShangtongZhang commented 4 years ago

while under off-policy it should only select LOWER STATE. If the agent follows the behavior policy (b), why it only selects LOWER STATE?

mattgithub1919 commented 4 years ago

In behaviour policy, it uniformly selects all 7 states and that's how we get REWARD. However, in my understanding, you should use target policy(which selects LOWER state only) when calculating TD error.

Screen Shot 2020-04-05 at 6 04 55 PM
ShangtongZhang commented 4 years ago

That's wrong. If you can sample next_state using the target policy, then it is not off-policy at all.

mattgithub1919 commented 4 years ago

Yes, I agree with you. The problem is when you compute r + v(s', w) you used next_state as s'. next_state is behavior policy, not target policy. s' should be the state under target policy which is 100% LOWER STATE.

ShangtongZhang commented 4 years ago

s' should be the state under target policy This is wrong.

mattgithub1919 commented 4 years ago

Not sure why you thought that was wrong. I think we shouldn't use next_state as s' as using next_state as s' will make it on-policy learning. The reason why it still diverges is because rho is computed according to off-policy.

ShangtongZhang commented 4 years ago

When computing r + v(s', w), s^\prime should be sampled from behavior policy, and next_state in the code is indeed sampled from the behavior policy

mattgithub1919 commented 4 years ago

Figure 11.2 is off-policy Q learning, right? It would be on-policy if s' were using behavior-policy's next_state.

ShangtongZhang commented 4 years ago

Figure 11.2 is off-policy Q learning, right?

It's off-policy TD.

It would be on-policy if s' were using behavior-policy's next_state.

This's wrong. You have fundamental misunderstanding about on- and off- policy.

mattgithub1919 commented 4 years ago

OK. Thank you for your time. Have a great day!