Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
17.18k stars 4.16k forks source link

NaN rewards with SAC only #3041

Closed niskander closed 4 years ago

niskander commented 4 years ago

Hi, I tried the SAC trainer and I get NaN rewards whenever it updates (image attached). My environment is returning valid rewards and the issue does not exist for PPO. Any idea what could be wrong?

SAC_nan_rewards_LI

chriselion commented 4 years ago

I can reproduce similar behavior with our examples (it's only NaN sometimes but not every time). I'll look into it.

chriselion commented 4 years ago

I think the NaNs are coming from here https://github.com/Unity-Technologies/ml-agents/blob/0796a1b5ff68a9e9b2b456d7088245edb2c50add/ml-agents/mlagents/trainers/sac/trainer.py#L241-L244 when self.cumulative_returns_since_policy_update is empty. It should be harmless, but we should probably output 0 or None in that situation.

@ervteng Any other thoughts?

niskander commented 4 years ago

@chriselion This is literally all I'm seeing (it's never not NaN), despite the rewards always being valid in the "Step" logs. So why would cumulative_returns_since_policy_updatealways be empty?

niskander commented 4 years ago

If it helps, the Cumulative Reward graph in tensorboard contains valid values.

ervteng commented 4 years ago

Hi @niskander, the code outputs nan there (when running in --debug mode) when there hasn't been a completed episode since the last policy update. Since SAC updates policy much more frequently than PPO, the probability of getting a nan in that message is much higher. However, the step logs happen at a much longer interval, so usually there has been a completed episode by then. The nans are entirely harmless.

@chriselion you're probably right, we should output None or Invalid in that message (and coincidentally in the CSV log).

niskander commented 4 years ago

@ervteng that's what I was thinking but does it make sense, in the screenshot, that an episode was completed and yet the next few returns were still NaN? I can try to save the output to double check, but I don't see a single non-NaN in the returns.

On the other hand, it appears to be training "correctly" as far as I can see. I just want to make sure these NaN values don't affect the training (which seems to be the case).

ervteng commented 4 years ago

Yep, it makes sense, since it's episodes completed since last policy update. Every time it prints out a message, it's one policy update. So no episode was completed between the two messages, hence it's nan. By default SAC updates every environment step, so it's unlikely you'll see many non-nans.

niskander commented 4 years ago

Alright, thanks for clarifying @ervteng @chriselion

chriselion commented 4 years ago

We've got this logged as MLA-414 in our tracker. Closing this for now.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.