add script that reproduces example 12.14

Description

This is a self-contained script that reproduces the missing figure 12.14.

The script could be condensed in many ways (like having unique functions for sarsa(lambda) & policies). Avoiding the OOP would also have made the code way shorter, but that's the way I've worked on the book to make things clear.

Unfortunately, I won't have time to refactor the code in the next days. I hope it will do the job for the purpose of the repo !

Remark

Just a side remark about the Mountain Car task, for those who would be interested :

I tried to reproduce the exact same conditions as the original paper [1] (even the slightly different tiling setup used in [2]), without being able to get the earlier divergence of the accumulating trace. I also tried many other things (different tiling, init positions, training time) without being able to get the exact same pattern.

I noticed that [4] get results pretty close to mine : Accumulating traces method doesn't lead to such a significant early divergence as shown by Sutton's paper figure. Any suggestions or remarks on that matter are welcome.

Still, this figure shows both the effect of lambda on training performances and how better is the replacing method compared to the accumulating one on this task. It shouldn't be a big deal then.

Usage

python3 lambda_effect.py

The whole script takes a little less than 2 hours to run on Colab.

Output

fig_12_14

The puddle world map used for the experiments is the following one :
puddleworld_map Where the green top-right corner is the termination area, and the puddle penalty policy follows [2]. The init position is randomly picked over all the non-terminal states at each epoch.

References

The experiments are detailed in those papers :

Mountain Car and Random Walk: [1] Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine learning, 22(1), 123-158.

Puddle World: [2] Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in neural information processing systems, 1038-1044.

Cart Pole: [3] Sutton, R. S. (1984) Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, MA.

Useful papers to get a better grasp on this example :

More details on Mountain Car: [4] Främling, K. (2007, April). Replacing eligibility trace for action-value learning with function approximation. In ESANN (pp. 313-318). (https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2007-49.pdf)

I decided to rely on CMAC for Cart Pole, but I tried to make the encoding close to the original "boxes" approach described in this paper: [5] Michie, D., & Chambers, R. A. (1968). BOXES: An experiment in adaptive co0.09375ntrol. Machine intelligence, 2(2), 137-152.

Credits

The Cart and Pole environment's code has been taken from openai gym source code. The tile coding software comes from Sutton's website.

ShangtongZhang / reinforcement-learning-an-introduction