Closed philwinder closed 4 years ago
Hi @philwinder,
I think you are correct in saying that the first item will be repeatedly chosen in that scenario, however, I would guess that practically it would be extremely rare that the values of some softmax will be exactly equal, so I'm not sure how much of a problem this is in practice.
Do you have an actual example of where this happens? In any case, I think it's a good idea to fix it just to be on the safe side.
Shadi
Hi @shadiendrawis.
I affected me when I implemented a standard tabular Q-learning agent (no NN). When the agent starts all the values in the table are zero and there is no best Q value.
You are right that the standard random initialisations in ANNs will probably avoid this. And I could also randomise the initial values in the Q-table.
But I think it would be safer and simpler to implement this in the egreedy code. An example from another framework is here: https://github.com/david-abel/simple_rl/blob/master/simple_rl/agents/QLearningAgentClass.py#L163
Thanks, Phil
PR #423 should fix this issue.
Hi again, Whilst trying to implement a standard Q-Learning agent (see #407) I think I have found a bug in the e-greedy implementation here: https://github.com/NervanaSystems/coach/blob/master/rl_coach/exploration_policies/e_greedy.py#L93
Imagine the situation (in a discrete action space) where q-values are equal across all actions. The
np.argmax
will repeatedly pick the first item in the list as the winner. This results in too many choices of the first action.Unless the action list is shuffled somewhere else, I think this affects all algorithms. Please correct me if I am wrong.
Thanks, Phil