Open glmcdona opened 3 years ago
Hello,
To add to @glmcdona, I'm getting the exact same issue but with a Box
action space (if that makes any difference). After the update with the first minibatch the networks are filled with nan
s.
I will try to replicate with a classic gym env (by the way the pendulum-v0 from the examples is deprecated I think).
This error only occurs with the optax adam
optimizer. Workaround is to use sgd
optimizer. Error does not reproduce with TestPPOClip->test_update_discrete()
or the example pong PPO with adam
optimizer. Maybe close this issue unless a reliable repro can be created?
Hi Geoff! Thanks for telling me about this one.
It's very surprising that replacing optax.adam
by optax.sgd
seems to help. Perhaps the adam accumulators are contaminated by one a non-finite gradient somewhere?
Would it be possible to share a Colab notebook?
Describe the bug Hey Kris, love your framework! Working with a custom environment, and your discrete action unit test works perfect locally. Don't spend much time investigating this yet, just creating this incase something jumps out at you as the problem. I plan on continuing to debug this issue.
During the first PPOClip update with the custom gym, the model weights get changed to
+/-inf
despite a non-infinite grad.Expected behavior
Results in:
Here is the full repro script taken from the Pong PPO example and slightly modified, but it won't work because of the custom environment. This is a dummy-example, not the actual policy and value networks that would be used: