Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
16.93k stars 4.14k forks source link

NaN received by OnActionReceived() during SAC training and inference #4618

Closed Dastyn closed 3 years ago

Dastyn commented 3 years ago

Hi,

I'm facing a NaN received by OnActionReceived() during training and inference. After a certain amount of steps, for instance during the learning, the log displays:

...
2020-10-31 17:37:50 INFO [stats.py:118] Rbehaviour. Step: 767000. Time Elapsed: 8041.055 s. Mean Reward: -6.940. Std of Reward: 0.413. Training.
2020-10-31 17:38:09 INFO [stats.py:118] Rbehaviour. Step: 768000. Time Elapsed: 8059.390 s. Mean Reward: -1.561. Std of Reward: 0.000. Training.
2020-10-31 17:38:28 INFO [stats.py:118] Rbehaviour. Step: 769000. Time Elapsed: 8078.650 s. Mean Reward: -9.748. Std of Reward: 5.141. Training.
2020-10-31 17:38:49 INFO [stats.py:118] Rbehaviour. Step: 770000. Time Elapsed: 8099.602 s. Mean Reward: -6.983. Std of Reward: 0.261. Training.
2020-10-31 17:39:10 INFO [stats.py:118] Rbehaviour. Step: 771000. Time Elapsed: 8120.284 s. Mean Reward: -6.681. Std of Reward: 0.000. Training.
2020-10-31 17:39:36 INFO [stats.py:118] Rbehaviour. Step: 772000. Time Elapsed: 8146.828 s. No episode was completed since last summary. Training.
2020-10-31 17:40:03 INFO [stats.py:118] Rbehaviour. Step: 773000. Time Elapsed: 8173.402 s. No episode was completed since last summary. Training.
2020-10-31 17:40:26 INFO [stats.py:118] Rbehaviour. Step: 774000. Time Elapsed: 8196.769 s. No episode was completed since last summary. Training.
2020-10-31 17:40:53 INFO [stats.py:118] Rbehaviour. Step: 775000. Time Elapsed: 8223.501 s. No episode was completed since last summary. Training.
2020-10-31 17:41:19 INFO [stats.py:118] Rbehaviour. Step: 776000. Time Elapsed: 8250.074 s. No episode was completed since last summary. Training.
...

Note, to clarify a bit: in my code, there is a call to AddReward inside the OnActionReceived function body. The value added to the reward, there, is partly based on the values of act[]. As act[] is apparently full of NaN after 771k steps, an error is raised after calling AddReward. This is the corresponding callstack:

ArgumentException: NaN increment passed to AddReward.
Unity.MLAgents.Utilities.DebugCheckNanAndInfinity (System.Single value, System.String valueCategory, System.String caller) (at Library/PackageCache/com.unity.ml-agents@1.5.0-preview/Runtime/Utilities.cs:84)
Unity.MLAgents.Agent.AddReward (System.Single increment) (at Library/PackageCache/com.unity.ml-agents@1.5.0-preview/Runtime/Agent.cs:670)
Ragent.MoveAgent (System.Single[] act) (at Assets/R/Scripts/Ragent.cs:82)
Ragent.OnActionReceived (System.Single[] vectorAction) (at Assets/R/Scripts/Ragent.cs:108)
Unity.MLAgents.Agent.OnActionReceived (Unity.MLAgents.Actuators.ActionBuffers actions) (at Library/PackageCache/com.unity.ml-agents@1.5.0-preview/Runtime/Agent.cs:1204)
Unity.MLAgents.Actuators.VectorActuator.OnActionReceived (Unity.MLAgents.Actuators.ActionBuffers actionBuffers) (at Library/PackageCache/com.unity.ml-agents@1.5.0-preview/Runtime/Actuators/VectorActuator.cs:65)
Unity.MLAgents.Actuators.ActuatorManager.ExecuteActions () (at Library/PackageCache/com.unity.ml-agents@1.5.0-preview/Runtime/Actuators/ActuatorManager.cs:240)
Unity.MLAgents.Agent.AgentStep () (at Library/PackageCache/com.unity.ml-agents@1.5.0-preview/Runtime/Agent.cs:1275)
Unity.MLAgents.Academy.EnvironmentStep () (at Library/PackageCache/com.unity.ml-agents@1.5.0-preview/Runtime/Academy.cs:584)
Unity.MLAgents.AcademyFixedUpdateStepper.FixedUpdate () (at Library/PackageCache/com.unity.ml-agents@1.5.0-preview/Runtime/Academy.cs:43)

These are the observation_space and action_space:

Inputs (2) visual_observation_0 shape: (-1,50,50,12) vector_observation shape: (-1,1,1,9)

Outputs (3) policy/concat/concat shape: (-1,-1,-1,-1) action shape: (-1,-1,-1,-1) action_probs shape: (-1,-1,-1,-1)

The yaml configuration file is as follows:

behaviors:
  Rbehaviour:
    trainer_type: sac
    hyperparameters:
      learning_rate: 0.005
      learning_rate_schedule: constant
      batch_size: 16 
      buffer_size: 2048
      buffer_init_steps: 1000
      tau: 0.01
      steps_per_update: 100
      save_replay_buffer: false
      init_entcoef: 0.1
      reward_signal_steps_per_update: 10.0
    network_settings:
      normalize: false
      hidden_units: 512
      num_layers: 3
      vis_encode_type: simple
    keep_checkpoints: 5
    checkpoint_interval: 50000
    max_steps: 1000000
    time_horizon: 128
    summary_freq: 1000
    threaded: true
    reward_signals:
      curiosity:
       strength: 0.2
       gamma: 0.99
       encoding_size: 256
       learning_rate: 1e-5 
      extrinsic:
        gamma: 0.995
        strength: 0.9
      gail:
        gamma: 0.99
        strength: 0.5
        encoding_size: 256
        learning_rate: 0.003
        use_actions: true
        use_vail: false
        demo_path: Assets/Demonstrations/demo.demo
    behavioral_cloning:
      demo_path: Assets/Demonstrations/demo.demo
      strength: 0.5
      steps: 150000

TensorBoard shows the following graphs: Capture Capture2

Question: Something happens around 771k steps and I would like to understand where it comes from (my guess is that something went wrong in sac processing ... but where to look for?). Please, could you give some hints and especially how to instrument the sac code to investigate further?

Thanks in advance!

Environment:

ervteng commented 3 years ago

Hey @Dastyn, ugh NaN's can be pretty crappy to debug. First thing I'd check is if there's any logs on the C# side saying that there are NaN observations. And check to see that the observations and rewards are all of reasonable size (around -1 to 1, and no huge positive and negative values).

Also, from your plots is Curiosity or GAIL reward hitting NaN first? I suspect it might be coming from one of those modules. Since Curiosity doesn't work too well with SAC anyways, turning that off might help.

Dastyn commented 3 years ago

Hi @ervteng, thanks for your reply.

Wrt NaN values in observations: I've already verified that no error-prone values were generated by behaviors, observations, or rewards. All state observations are properly Mathf.Clamp'ed to ensure that they are between [-1f, 1f] (however, by design, they already belong to the correct interval).

About the turning point of NaN: Impossible to say at this point: NaN appearing between two (1000-large) checkpoints for the two measurements.

I'll take your advice and turn Curiosity off in the next tries. Thanks again.

Dastyn commented 3 years ago

Definitely difficult to investigate. Then closing the issue.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.