YYCAAA / V-MPO_Lunarlander

Simple implementation of V-MPO proposed in https://arxiv.org/abs/1909.12238
MIT License
44 stars 6 forks source link

update_timestep leads to NaNs #1

Open gianlucadest opened 3 years ago

gianlucadest commented 3 years ago

Hello,

I tried to test the algorithm on CartPole-v1 and, depending on the update_timestep parameter, it crashes due to NaNs in the KL divergence. Is this a bug or is it just a sensitive parameter?

update_timestep = 100 # leads to occasional crashes update_timestep = 200 # lesser crashes than with 100 update_timestep = 500 # Is stable

YYCAAA commented 3 years ago

Hello,

I tried to test the algorithm on CartPole-v1 and, depending on the update_timestep parameter, it crashes due to NaNs in the KL divergence. Is this a bug or is it just a sensitive parameter?

update_timestep = 100 # leads to occasional crashes update_timestep = 200 # lesser crashes than with 100 update_timestep = 500 # Is stable

Hi Gianluca, I'm currently very busy with my PhD application, and I have almost forgotten this project. Your issue seems to be general, you could try to debug it yourself or with your friends.

gianlucadest commented 3 years ago

Hello, I tried to test the algorithm on CartPole-v1 and, depending on the update_timestep parameter, it crashes due to NaNs in the KL divergence. Is this a bug or is it just a sensitive parameter? update_timestep = 100 # leads to occasional crashes update_timestep = 200 # lesser crashes than with 100 update_timestep = 500 # Is stable

Hi Gianluca, I'm currently very busy with my PhD application, and I have almost forgotten this project. Your issue seems to be general, you could try to debug it yourself or with your friends.

Hello YYCAAA,

thank you for your answer. There seems to be an issue with your reward estimation. In the current case, your code just works with full episodes because you never call the critic network with the final state. This needs to be fixed to work properly. This will probably fix the issue.