PacktPublishing / Deep-Reinforcement-Learning-Hands-On

Hands-on Deep Reinforcement Learning, published by Packt
MIT License
2.83k stars 1.28k forks source link

Chapter08: exploration in the validation procedure, is it an issue ? #50

Closed domixit closed 5 years ago

domixit commented 5 years ago

Hello Max ... great work that allow us to dig into RL world ...

in chapter08 code :

in the validation.py procedure I have noticed that epsilon is kept to a non zero (default to 0.2) which means that the policy is not greedy but rather epsilon-greedy, thi smeans that 2 out of 10 actions are random!!!

RL teory says that it should only be greedy (epsilon=0) is it an error or deliberately done ?

Shmuma commented 5 years ago

Hi!

Good question! It wasn't stated explicitly in the book, will add in the 2nd edition couple of sentences about this.

The reason behind non-zero eps is to test the robustness of our policy by introducing the noice into testing sequence. The main motivation behind this is we don't want the network to just remember and replay some best sequence of actions (which could easily be the only sequence in deterministic environments). We want the network to be robust and know how to recover from random perturbations. So, we inject the random actions and do several tests. In fact, 0.2 could be a bit too high, probably leftover from some experiment. I normally use 2-5%.