Closed mattgithub1919 closed 4 years ago
https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/blob/master/chapter11/counterexample.py#L73 It depends on action, doesn't it?
Thank you for your response. I think in Figure 11.2, the target policy doesn't depend on action, it is 100% selecting LOWER STATE. You can check the highlighted sentences in the following pic.
while under off-policy it should only select LOWER STATE.
If the agent follows the behavior policy (b), why it only selects LOWER STATE?
In behaviour policy, it uniformly selects all 7 states and that's how we get REWARD. However, in my understanding, you should use target policy(which selects LOWER state only) when calculating TD error.
That's wrong. If you can sample next_state using the target policy, then it is not off-policy at all.
Yes, I agree with you. The problem is when you compute r + v(s', w) you used next_state as s'. next_state is behavior policy, not target policy. s' should be the state under target policy which is 100% LOWER STATE.
s' should be the state under target policy
This is wrong.
Not sure why you thought that was wrong. I think we shouldn't use next_state as s' as using next_state as s' will make it on-policy learning. The reason why it still diverges is because rho is computed according to off-policy.
When computing r + v(s', w), s^\prime should be sampled from behavior policy,
and next_state
in the code is indeed sampled from the behavior policy
Figure 11.2 is off-policy Q learning, right? It would be on-policy if s' were using behavior-policy's next_state.
Figure 11.2 is off-policy Q learning, right?
It's off-policy TD.
It would be on-policy if s' were using behavior-policy's next_state.
This's wrong. You have fundamental misunderstanding about on- and off- policy.
OK. Thank you for your time. Have a great day!
Hello,
Thank you for your work. I have a question about the semi_gradient_off_policy_TD function. It looks like it is using on-policy update in line 79 as next_state is uniform selection of 7 states while under off-policy it should only select LOWER STATE. In my understanding, Figure 11.2 does off-policy, not on-policy. Correct me if I am wrong. Thank you.
Warm regards, Matt