IntelLabs / coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
https://intellabs.github.io/coach/
Apache License 2.0
2.32k stars 460 forks source link

Bug: Q-Value ties are not random #414

Closed philwinder closed 4 years ago

philwinder commented 4 years ago

Hi again, Whilst trying to implement a standard Q-Learning agent (see #407) I think I have found a bug in the e-greedy implementation here: https://github.com/NervanaSystems/coach/blob/master/rl_coach/exploration_policies/e_greedy.py#L93

Imagine the situation (in a discrete action space) where q-values are equal across all actions. The np.argmax will repeatedly pick the first item in the list as the winner. This results in too many choices of the first action.

Unless the action list is shuffled somewhere else, I think this affects all algorithms. Please correct me if I am wrong.

Thanks, Phil

shadiendrawis commented 4 years ago

Hi @philwinder,

I think you are correct in saying that the first item will be repeatedly chosen in that scenario, however, I would guess that practically it would be extremely rare that the values of some softmax will be exactly equal, so I'm not sure how much of a problem this is in practice.

Do you have an actual example of where this happens? In any case, I think it's a good idea to fix it just to be on the safe side.

Shadi

philwinder commented 4 years ago

Hi @shadiendrawis.

I affected me when I implemented a standard tabular Q-learning agent (no NN). When the agent starts all the values in the table are zero and there is no best Q value.

You are right that the standard random initialisations in ANNs will probably avoid this. And I could also randomise the initial values in the Q-table.

But I think it would be safer and simpler to implement this in the egreedy code. An example from another framework is here: https://github.com/david-abel/simple_rl/blob/master/simple_rl/agents/QLearningAgentClass.py#L163

Thanks, Phil

shadiendrawis commented 4 years ago

PR #423 should fix this issue.