Chapter08: exploration in the validation procedure, is it an issue ?

PacktPublishing / Deep-Reinforcement-Learning-Hands-On

Hands-on Deep Reinforcement Learning, published by Packt

MIT License

2.83k stars 1.28k forks source link

Hi!

Good question! It wasn't stated explicitly in the book, will add in the 2nd edition couple of sentences about this.

The reason behind non-zero eps is to test the robustness of our policy by introducing the noice into testing sequence. The main motivation behind this is we don't want the network to just remember and replay some best sequence of actions (which could easily be the only sequence in deterministic environments). We want the network to be robust and know how to recover from random perturbations. So, we inject the random actions and do several tests. In fact, 0.2 could be a bit too high, probably leftover from some experiment. I normally use 2-5%.

PacktPublishing / Deep-Reinforcement-Learning-Hands-On

Chapter08: exploration in the validation procedure, is it an issue ? #50