LyWangPX / Reinforcement-Learning-2nd-Edition-by-Sutton-Exercise-Solutions

Solutions of Reinforcement Learning, An Introduction
MIT License
2.04k stars 465 forks source link

error in exercise 11.3 #53

Closed gakshaygupta closed 4 years ago

gakshaygupta commented 4 years ago

The q-learning is inherently off-policy control still you have multiplied it with the rho. I think there is no need to multiply the rho factor.

LyWangPX commented 4 years ago

You are right, that's a mistake. I will fix it asap .Thanks for pointing that out.

gakshaygupta commented 4 years ago

You are right, that's a mistake. I will fix it asap .Thanks for pointing that out.

also, I have implemented my version can you check it out as in my case the parameters are not diverging? here is the code semi.txt

LyWangPX commented 4 years ago

Sure, I will take a look.

LyWangPX commented 4 years ago

Well, just an advice on coding style: you may want to keep your variables as few as possible. For instance, if you decide to use 'dash' and 'solid' as action definition, you should not use 0 and 1 to represent them again, just check it as string everywhere.

And for the scale of the problem, I think a overcomplicated OOP will make the debug a little bit confusing for me.

Just advices.

gakshaygupta commented 4 years ago

I understand that but I am writing the code in some general sense that I have set. That's why there is bit complexity in the code which is not required to solve this problem.

LyWangPX commented 4 years ago

It takes me a while to produce a diverge case in your code (which simplified by me a bit) There are 2 reasons:

  1. The feature function is not that same as the book indicates:

book definition of state 1: (2, 0 ,0 ,0 ,0 ,0 ,0 ,0, 1 ) (not include action) your definition of state 1: (1, 0 ,0 ,0 ,0 ,0 ,0 ,0, 2 )

  1. Do not use 2 parameters to evaluate action values. Instead, use only one parameter to evaluate. For instance:

state 1, solid: (2, 0 ,0 ,0 ,0 ,0 ,0 ,0, 1, 1) state 1, dashed: (2, 0 ,0 ,0 ,0 ,0 ,0 ,0, 1, 0)

By fixing the upper 2 points, W will go overflow as book indicates.

Also, I played with your action() function. If you pick the action based on episilion-greedy function, even the set of above conditions will lead to convergence.

Interesting, right? Jsut as book says.

My thoughts about why using 2 w to evaluate action values will fail to diverge: First it is a good thing based on its result. But it somehow break the key to diverge in this counter example: you have to make it update all states at once. Add 2 action values breaks the condition, and leads to better result.

LyWangPX commented 4 years ago

My modification: https://colab.research.google.com/drive/14uL-DZNca9YKBNG1IQnkn778NQxKQRJ8?usp=sharing

I also did the above statements in your original code, but I guess you can try it out by yourself.

And, actually my original code is compeletely misleading, so I will rewrite it. Thanks for pointing the issue.

gakshaygupta commented 4 years ago

Initially, I tried with only one 'w' as I saw that the weights are not diverging and thought that my formulation is wrong and added two 'w'. Also, I have seen an interesting point. When you use dashed=1 and solid=0 it converges. I think that the weight for action is acting as a regularizer in this case that's why the weights are converging. And thanks for correcting the code.