Open Jacobi93 opened 5 years ago
Same here, the algorithm couldn't converge as the example does. But off-policy Q-learning with linear function approximation does not guarantee convergence, according to David Silver's lecture notes 6 page 32. It is interesting that how the original example gets converged.
My result:
Do not guarantee means that it may converge, is not guaranteed. Different initializers and random policies may lead to different results. but maybe it is better for the author to mention it. Thanks.
Hi, thank you for your wonderful codes. It helps me a lot. In the REINFORCE with baseline for cliff_walking, I could not obtain stable results. The best reward should be -15 as you plotted. But sometimes when I run the code without any change, it converges to -100, which is very weird. Could anyone run the code for several times and find out why is that? Thank you so much.