Add ex8.4 code and plot

I just did some codes for ex8.4. I first reproduced example 8.2 with Dyna-Q and Dyna-Q+. Then I made the experiment based on this example. The result shows Dyna-Q+exp is better than Dyna-Q and almost as good as Dyna-Q+ in the first 1000 steps. But after environment change, it can't catch up with Dyna-Q+.

I think this is probably because Dyna-Q+ updates bonus directly in Q values, this makes it can quickly do exploration again after environment change. But Dyna-Q+exp only consider bonus when choose maximal Q, this will take longer time to do exploration than Dyna-Q+. In other words, the bonus is not 'accumulated' in Q value for Dyna-Q+exp.

ex8_4

LyWangPX / Reinforcement-Learning-2nd-Edition-by-Sutton-Exercise-Solutions

Add ex8.4 code and plot #57