Closed niskander closed 4 years ago
I can reproduce similar behavior with our examples (it's only NaN sometimes but not every time). I'll look into it.
I think the NaN
s are coming from here
https://github.com/Unity-Technologies/ml-agents/blob/0796a1b5ff68a9e9b2b456d7088245edb2c50add/ml-agents/mlagents/trainers/sac/trainer.py#L241-L244
when self.cumulative_returns_since_policy_update
is empty. It should be harmless, but we should probably output 0 or None in that situation.
@ervteng Any other thoughts?
@chriselion This is literally all I'm seeing (it's never not NaN), despite the rewards always being valid in the "Step" logs. So why would cumulative_returns_since_policy_update
always be empty?
If it helps, the Cumulative Reward graph in tensorboard contains valid values.
Hi @niskander, the code outputs nan there (when running in --debug
mode) when there hasn't been a completed episode since the last policy update. Since SAC updates policy much more frequently than PPO, the probability of getting a nan in that message is much higher. However, the step logs happen at a much longer interval, so usually there has been a completed episode by then. The nans are entirely harmless.
@chriselion you're probably right, we should output None or Invalid in that message (and coincidentally in the CSV log).
@ervteng that's what I was thinking but does it make sense, in the screenshot, that an episode was completed and yet the next few returns were still NaN? I can try to save the output to double check, but I don't see a single non-NaN in the returns.
On the other hand, it appears to be training "correctly" as far as I can see. I just want to make sure these NaN values don't affect the training (which seems to be the case).
Yep, it makes sense, since it's episodes completed since last policy update. Every time it prints out a message, it's one policy update. So no episode was completed between the two messages, hence it's nan. By default SAC updates every environment step, so it's unlikely you'll see many non-nans.
Alright, thanks for clarifying @ervteng @chriselion
We've got this logged as MLA-414 in our tracker. Closing this for now.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hi, I tried the SAC trainer and I get NaN rewards whenever it updates (image attached). My environment is returning valid rewards and the issue does not exist for PPO. Any idea what could be wrong?