Closed gakshaygupta closed 4 years ago
You are right, that's a mistake. I will fix it asap .Thanks for pointing that out.
You are right, that's a mistake. I will fix it asap .Thanks for pointing that out.
also, I have implemented my version can you check it out as in my case the parameters are not diverging? here is the code semi.txt
Sure, I will take a look.
Well, just an advice on coding style: you may want to keep your variables as few as possible. For instance, if you decide to use 'dash' and 'solid' as action definition, you should not use 0 and 1 to represent them again, just check it as string everywhere.
And for the scale of the problem, I think a overcomplicated OOP will make the debug a little bit confusing for me.
Just advices.
I understand that but I am writing the code in some general sense that I have set. That's why there is bit complexity in the code which is not required to solve this problem.
It takes me a while to produce a diverge case in your code (which simplified by me a bit) There are 2 reasons:
book definition of state 1: (2, 0 ,0 ,0 ,0 ,0 ,0 ,0, 1 ) (not include action) your definition of state 1: (1, 0 ,0 ,0 ,0 ,0 ,0 ,0, 2 )
state 1, solid: (2, 0 ,0 ,0 ,0 ,0 ,0 ,0, 1, 1) state 1, dashed: (2, 0 ,0 ,0 ,0 ,0 ,0 ,0, 1, 0)
By fixing the upper 2 points, W will go overflow as book indicates.
Also, I played with your action() function. If you pick the action based on episilion-greedy function, even the set of above conditions will lead to convergence.
Interesting, right? Jsut as book says.
My thoughts about why using 2 w to evaluate action values will fail to diverge: First it is a good thing based on its result. But it somehow break the key to diverge in this counter example: you have to make it update all states at once. Add 2 action values breaks the condition, and leads to better result.
My modification: https://colab.research.google.com/drive/14uL-DZNca9YKBNG1IQnkn778NQxKQRJ8?usp=sharing
I also did the above statements in your original code, but I guess you can try it out by yourself.
And, actually my original code is compeletely misleading, so I will rewrite it. Thanks for pointing the issue.
Initially, I tried with only one 'w' as I saw that the weights are not diverging and thought that my formulation is wrong and added two 'w'. Also, I have seen an interesting point. When you use dashed=1 and solid=0 it converges. I think that the weight for action is acting as a regularizer in this case that's why the weights are converging. And thanks for correcting the code.
The q-learning is inherently off-policy control still you have multiplied it with the rho. I think there is no need to multiply the rho factor.