IntelLabs / coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
https://intellabs.github.io/coach/
Apache License 2.0
2.32k stars 460 forks source link

Fix for issue 418 #451

Open ghidellar opened 4 years ago

shadiendrawis commented 4 years ago

Hi,

Nice catch, do you happen to have some example where this happens and the change fixes it? I've run into some issues in other contexts before where probabilities become NaN if the agent doesn't train correctly, my only issue with the change is that adding the epsilon to the output policy vector makes that vector no longer a probability vector, but apparently the tensorflow distributions don't have a problem with that (probably because that epsilon is too small in most cases). Is there any other place where you can add that epsilon where it makes more sense mathematically and fixes the issue?

Regards, Shadi

ghidellar commented 4 years ago

Hi @shadiendrawis , basically the Categorical distribution created using the softmax output contained not proper values ( this would be catched right away setting the tf.distributions.kl_divergence param value allow_nan_stats to False) . After that, all the values calculated using the action_probs_wrt_policy are not defined (entropy, kl_divergence, surrogate_loss) and the action_values = self.get_prediction(curr_state) call in the chooseAction contains Nan values. At this point I am thinking that we could perform a sanity check on self.action_probs_wrt_policy after it receives the log probs but sounds not so elegant to me and the KL divergence would be still not well defined. I reproduced this on a run using just 2 discrete actions in a multi feature continuous state space.

shadiendrawis commented 3 years ago

In that case, wouldn't it be more correct to try and fix the softmax outputs? how does adding an epsilon to the softmax outputs (which doesn't make them a probability vector) fix the problem? It would be nice if you could provide an example that we can run to reproduce the problem locally and have a better understanding about what is causing it. The reasons I'm hesitant about this fix:

  1. I feel it masks the problem instead of fixing it since the self.action_probs_wrt_policy will still hold values that don't make sense.
  2. We've seen this NaN problem happen constantly before, only to disappear later once we found the right set of hyper-parameters for the problem.
ghidellar commented 3 years ago

Hi @shadiendrawis . Sorry for the delay of this reply. I could go back to this problem lately and followed your advice. I moved the fix instead of adding the eps to the probs , adding it to the logits input to the softmax layer . Actually I found a corner case scenario ( probably due to a particular data) which was still breaking the code. Moving the fix to the logits seems to fix this case as well so seems more solid .

self.policy_mean = tf.nn.softmax(policy_values + eps, name="policy")

Please let me know if this could make sense as we are not messing with probs anymore.