Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
17.19k stars 4.16k forks source link

Running multiple headless simulations on same node makes environment stop working #5337

Closed Tobblatzius closed 3 years ago

Tobblatzius commented 3 years ago

I run Unity-simulations with ML-agents on a hpc-cluster. I build my simulations in headless mode. When I run two simulations on the same node, one of them will always crash within a few minutes, giving me this error message. If the two simulations are on different nodes, this is not a problem. The two simulations are not close to maxing out the RAM, CPU or GPU usage of the node they are run at. Each node has 2 x 16 core Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz (total 32 cores) and 8 x Nvidia Tesla T4 GPU with 16GB RAM. I use mlagents version 0.23.0. I am not sure how to proceed.


2021-04-26 18:05:51 INFO [learn.py:275] run_seed set to 2363

                        ▄▄▄▓▓▓▓
                   ╓▓▓▓▓▓▓█▓▓▓▓▓
              ,▄▄▄m▀▀▀'  ,▓▓▓▀▓▓▄                           ▓▓▓  ▓▓▌
            ▄▓▓▓▀'      ▄▓▓▀  ▓▓▓      ▄▄     ▄▄ ,▄▄ ▄▄▄▄   ,▄▄ ▄▓▓▌▄ ▄▄▄    ,▄▄
          ▄▓▓▓▀        ▄▓▓▀   ▐▓▓▌     ▓▓▌   ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌  ╒▓▓▌
        ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓      ▓▀      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌   ▐▓▓▄ ▓▓▌
        ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄     ▓▓      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌    ▐▓▓▐▓▓
          ^█▓▓▓        ▀▓▓▄   ▐▓▓▌     ▓▓▓▓▄▓▓▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▓▄    ▓▓▓▓`
            '▀▓▓▓▄      ^▓▓▓  ▓▓▓       └▀▀▀▀ ▀▀ ^▀▀    `▀▀ `▀▀   '▀▀    ▐▓▓▌
               ▀▀▀▀▓▄▄▄   ▓▓▓▓▓▓,                                      ▓▓▓▓▀
                   `▀█▓▓▓▓▓▓▓▓▓▌
                        ¬`▀▀▀█▓

 Version information:
  ml-agents: 0.23.0,
  ml-agents-envs: 0.23.0,
  Communicator API: 1.3.0,
  PyTorch: 1.7.1
Found path: /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics/0/ecosim.x86_64
2021-04-26 18:05:55 INFO [environment.py:110] Connected to Unity environment with package version 1.0.6 and communication version 1.0.0
2021-04-26 18:05:57 INFO [environment.py:271] Connected new brain:
Deer?team=0
2021-04-26 18:05:57 INFO [environment.py:271] Connected new brain:
Wolf?team=0
2021-04-26 18:05:57 WARNING [stats.py:189] events.out.tfevents.1619446059.alvis1-03.79557.0 was left over from a previous run. Deleting.
2021-04-26 18:05:57 INFO [stats.py:145] Hyperparameters for behavior name Wolf: 
    trainer_type:   ppo
    hyperparameters:    
      batch_size:   256
      buffer_size:  10240
      learning_rate:    0.0
      beta: 0.005
      epsilon:  0.2
      lambd:    0.95
      num_epoch:    5
      learning_rate_schedule:   linear
    network_settings:   
      normalize:    False
      hidden_units: 256
      num_layers:   2
      vis_encode_type:  simple
      memory:   None
    reward_signals: 
      extrinsic:    
        gamma:  0.99
        strength:   1.0
    init_path:  ../../../ecosimdata/results/e2_large_noext/Wolf
    keep_checkpoints:   5
    checkpoint_interval:    500000
    max_steps:  1000000000
    time_horizon:   1024
    summary_freq:   10
    threaded:   True
    self_play:  None
    behavioral_cloning: None
    framework:  pytorch
2021-04-26 18:06:00 INFO [torch_model_saver.py:96] Starting training from step 0 and saving to results/dynamics_0/Wolf.
2021-04-26 18:06:00 WARNING [stats.py:189] events.out.tfevents.1619446061.alvis1-03.79557.1 was left over from a previous run. Deleting.
2021-04-26 18:06:00 INFO [stats.py:145] Hyperparameters for behavior name Deer: 
    trainer_type:   ppo
    hyperparameters:    
      batch_size:   256
      buffer_size:  10240
      learning_rate:    0.0
      beta: 0.005
      epsilon:  0.2
      lambd:    0.95
      num_epoch:    5
      learning_rate_schedule:   linear
    network_settings:   
      normalize:    False
      hidden_units: 256
      num_layers:   2
      vis_encode_type:  simple
      memory:   None
    reward_signals: 
      extrinsic:    
        gamma:  0.99
        strength:   1.0
    init_path:  ../../../ecosimdata/results/e2_large_noext/Deer
    keep_checkpoints:   5
    checkpoint_interval:    500000
    max_steps:  1000000000
    time_horizon:   1024
    summary_freq:   10
    threaded:   True
    self_play:  None
    behavioral_cloning: None
    framework:  pytorch
2021-04-26 18:06:00 INFO [torch_model_saver.py:96] Starting training from step 0 and saving to results/dynamics_0/Deer.
2021-04-26 18:06:51 INFO [stats.py:139] Deer. Step: 30. Time Elapsed: 59.961 s. Mean Reward: -0.017. Std of Reward: 0.000. Training.
2021-04-26 18:07:42 INFO [stats.py:139] Deer. Step: 70. Time Elapsed: 110.586 s. Mean Reward: -0.023. Std of Reward: 0.000. Training.
2021-04-26 18:08:20 INFO [stats.py:139] Deer. Step: 130. Time Elapsed: 148.659 s. Mean Reward: -0.029. Std of Reward: 0.000. Training.
2021-04-26 18:08:34 INFO [stats.py:139] Deer. Step: 220. Time Elapsed: 162.839 s. Mean Reward: -0.060. Std of Reward: 0.000. Training.
2021-04-26 18:09:06 INFO [stats.py:139] Deer. Step: 310. Time Elapsed: 194.257 s. Mean Reward: 1.561. Std of Reward: 0.000. Training.
2021-04-26 18:09:10 INFO [stats.py:139] Deer. Step: 420. Time Elapsed: 199.001 s. Mean Reward: -0.060. Std of Reward: 0.000. Training.
2021-04-26 18:09:45 INFO [stats.py:139] Deer. Step: 530. Time Elapsed: 233.468 s. Mean Reward: -0.044. Std of Reward: 0.000. Training.
2021-04-26 18:11:18 INFO [stats.py:139] Deer. Step: 660. Time Elapsed: 326.877 s. Mean Reward: 0.576. Std of Reward: 0.000. Training.
2021-04-26 18:11:58 INFO [stats.py:139] Deer. Step: 720. Time Elapsed: 366.412 s. Mean Reward: -0.027. Std of Reward: 0.000. Training.
2021-04-26 18:12:01 INFO [stats.py:139] Deer. Step: 890. Time Elapsed: 369.392 s. Mean Reward: 0.558. Std of Reward: 0.000. Training.
2021-04-26 18:12:13 INFO [stats.py:139] Deer. Step: 970. Time Elapsed: 381.272 s. Mean Reward: -0.021. Std of Reward: 0.000. Training.
2021-04-26 18:12:40 INFO [stats.py:139] Deer. Step: 1140. Time Elapsed: 408.681 s. Mean Reward: -0.080. Std of Reward: 0.000. Training.
2021-04-26 18:12:55 INFO [stats.py:139] Deer. Step: 1330. Time Elapsed: 423.476 s. Mean Reward: -0.090. Std of Reward: 0.000. Training.
2021-04-26 18:13:40 INFO [stats.py:139] Deer. Step: 1520. Time Elapsed: 468.364 s. Mean Reward: -0.067. Std of Reward: 0.000. Training.
2021-04-26 18:13:58 INFO [stats.py:139] Deer. Step: 1730. Time Elapsed: 486.872 s. Mean Reward: 0.786. Std of Reward: 0.000. Training.
2021-04-26 18:14:55 INFO [stats.py:139] Deer. Step: 1940. Time Elapsed: 543.806 s. Mean Reward: -0.118. Std of Reward: 0.000. Training.
2021-04-26 18:15:21 INFO [stats.py:139] Deer. Step: 2160. Time Elapsed: 569.275 s. Mean Reward: -0.091. Std of Reward: 0.000. Training.
2021-04-26 18:15:47 INFO [stats.py:139] Deer. Step: 2300. Time Elapsed: 595.904 s. Mean Reward: 0.674. Std of Reward: 0.000. Training.
2021-04-26 18:17:19 INFO [subprocess_env_manager.py:186] UnityEnvironment worker 0: environment stopping.
2021-04-26 18:17:20 INFO [environment.py:407] Environment shut down with return code 0.
2021-04-26 18:17:20 INFO [model_serialization.py:104] Converting to results/dynamics_0/Wolf/Wolf-0.onnx
2021-04-26 18:17:20 INFO [model_serialization.py:116] Exported results/dynamics_0/Wolf/Wolf-0.onnx
2021-04-26 18:17:20 INFO [torch_model_saver.py:116] Copied results/dynamics_0/Wolf/Wolf-0.onnx to results/dynamics_0/Wolf.onnx.
2021-04-26 18:17:20 INFO [model_serialization.py:104] Converting to results/dynamics_0/Deer/Deer-2380.onnx
2021-04-26 18:17:20 INFO [model_serialization.py:116] Exported results/dynamics_0/Deer/Deer-2380.onnx
2021-04-26 18:17:20 INFO [torch_model_saver.py:116] Copied results/dynamics_0/Deer/Deer-2380.onnx to results/dynamics_0/Deer.onnx.
2021-04-26 18:17:20 INFO [trainer_controller.py:85] Saved Model
Traceback (most recent call last):
  File "/cephyr/users/tobiaka/Alvis/.local/bin/mlagents-learn", line 8, in <module>
    sys.exit(main())
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/learn.py", line 280, in main
    run_cli(parse_command_line())
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/learn.py", line 276, in run_cli
    run_training(run_seed, options)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/learn.py", line 153, in run_training
    tc.start_learning(env_manager)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/trainer_controller.py", line 176, in start_learning
    n_steps = self.advance(env_manager)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/trainer_controller.py", line 234, in advance
    new_step_infos = env_manager.get_steps()
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/env_manager.py", line 113, in get_steps
    new_step_infos = self._step()
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/subprocess_env_manager.py", line 276, in _step
    raise env_exception
mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
     The environment does not need user interaction to launch
     The Agents' Behavior Parameters > Behavior Type is set to "Default"
     The environment and the Python interface have compatible versions.
vincentpierre commented 3 years ago

It looks like the environment stopped responding. I see two possibilities : Either one of the two executables never launched (There is something in the environment that prevents two environments to run at the same time on the same node) OR one of the environment somehow crashed (ran out of resources for example). Since the error seems to come from inside the Unity Environment, can you look into the executable logs? They might give some hints as to why this happens.

Tobblatzius commented 3 years ago

I will check the logs, I can't access them right now though due to the compute-server being down. Do you mean the log in run_logs that mlagents put in the results folder during a run? Do you have any idea what possible causes there might be in cases when one environment prevents two from running at the same time? It seems like the second possibility is less likely due to the huge amount of resources available and that the environment runs fine if it is the only Unity simulation running on the node (there can be other jobs running and it is fine, but not Unity jobs).

Tobblatzius commented 3 years ago

This is the output I get in the standard output and the results/simulation/Player-0.log, when I start two simulations on the same node. When they are startet simultanuously, both seem to fail to start.

For the first job

Player-0.log

Mono path[0] = '/cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Managed'
Mono config path = '/cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/MonoBleedingEdge/etc'
Preloaded 'lib_burst_generated.so'
Preloaded 'libgrpc_csharp_ext.x64.so'
Initialize engine version: 2020.1.9f1 (145f5172610f)
[Subsystems] Discovering subsystems at path /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/UnitySubsystems
Forcing GfxDevice: Null
GfxDevice: creating device client; threaded=0
NullGfxDevice:
    Version:  NULL 1.0 [1.0]
    Renderer: Null Device
    Vendor:   Unity Technologies
Begin MonoManager ReloadAssembly
- Completed reload, in  1.909 seconds
ERROR: Shader Sprites/Default shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader Sprites/Mask shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader GUI/Text Shader shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader Legacy Shaders/VertexLit shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard (Specular setup)' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
ERROR: Shader Standard (Specular setup) shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard (Specular setup)' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
ERROR: Shader Legacy Shaders/Particles/Alpha Blended Premultiply shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader FX/Water (Basic) shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
ERROR: Shader Standard shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
UnloadTime: 0.759724 ms
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libSystem.dylib
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libSystem.dylib.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libSystem.dylib
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libSystem.dylib
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libSystem.dylib.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libSystem.dylib
WARNING: The communication API versions between Unity and python differ at the minor version level. Python API: 1.3.0, Unity API: 1.0.0 Python Library Version: 0.23.0 .
This means that some features may not work unless you upgrade the package with the lower version.Please find the versions that work best together from our release page.
https://github.com/Unity-Technologies/ml-agents/releases
Setting up 16 worker threads for Enlighten.
  Thread -> id: 2aec5c7fb700 -> priority: 1 
  Thread -> id: 2aec5c9fc700 -> priority: 1 
  Thread -> id: 2aec5cbfd700 -> priority: 1 
  Thread -> id: 2aec5cdfe700 -> priority: 1 
  Thread -> id: 2aec5cfff700 -> priority: 1 
  Thread -> id: 2aec5d200700 -> priority: 1 
  Thread -> id: 2aec5d401700 -> priority: 1 
  Thread -> id: 2aec5d602700 -> priority: 1 
  Thread -> id: 2aec5d803700 -> priority: 1 
  Thread -> id: 2aec5da04700 -> priority: 1 
  Thread -> id: 2aec5dc05700 -> priority: 1 
  Thread -> id: 2aec5de06700 -> priority: 1 
  Thread -> id: 2aec5e007700 -> priority: 1 
  Thread -> id: 2aec5e208700 -> priority: 1 
  Thread -> id: 2aec5e409700 -> priority: 1 
  Thread -> id: 2aec5e60a700 -> priority: 1 

Standard output:

2021-05-04 10:01:32.186033: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:From /cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2021-05-04 10:02:31 INFO [learn.py:275] run_seed set to 5767

                        ▄▄▄▓▓▓▓
                   ╓▓▓▓▓▓▓█▓▓▓▓▓
              ,▄▄▄m▀▀▀'  ,▓▓▓▀▓▓▄                           ▓▓▓  ▓▓▌
            ▄▓▓▓▀'      ▄▓▓▀  ▓▓▓      ▄▄     ▄▄ ,▄▄ ▄▄▄▄   ,▄▄ ▄▓▓▌▄ ▄▄▄    ,▄▄
          ▄▓▓▓▀        ▄▓▓▀   ▐▓▓▌     ▓▓▌   ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌  ╒▓▓▌
        ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓      ▓▀      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌   ▐▓▓▄ ▓▓▌
        ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄     ▓▓      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌    ▐▓▓▐▓▓
          ^█▓▓▓        ▀▓▓▄   ▐▓▓▌     ▓▓▓▓▄▓▓▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▓▄    ▓▓▓▓`
            '▀▓▓▓▄      ^▓▓▓  ▓▓▓       └▀▀▀▀ ▀▀ ^▀▀    `▀▀ `▀▀   '▀▀    ▐▓▓▌
               ▀▀▀▀▓▄▄▄   ▓▓▓▓▓▓,                                      ▓▓▓▓▀
                   `▀█▓▓▓▓▓▓▓▓▓▌
                        ¬`▀▀▀█▓

 Version information:
  ml-agents: 0.23.0,
  ml-agents-envs: 0.23.0,
  Communicator API: 1.3.0,
  PyTorch: 1.7.1
Found path: /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim.x86_64
2021-05-04 10:02:35 INFO [environment.py:110] Connected to Unity environment with package version 1.0.6 and communication version 1.0.0
2021-05-04 10:02:35 INFO [environment.py:271] Connected new brain:
Deer?team=0
2021-05-04 10:02:35 INFO [environment.py:271] Connected new brain:
Wolf?team=0
2021-05-04 10:02:35 INFO [stats.py:145] Hyperparameters for behavior name Deer: 
    trainer_type:   ppo
    hyperparameters:    
      batch_size:   256
      buffer_size:  10240
      learning_rate:    0.0003
      beta: 0.005
      epsilon:  0.2
      lambd:    0.95
      num_epoch:    5
      learning_rate_schedule:   linear
    network_settings:   
      normalize:    False
      hidden_units: 256
      num_layers:   2
      vis_encode_type:  simple
      memory:   None
    reward_signals: 
      extrinsic:    
        gamma:  0.99
        strength:   1.0
    init_path:  ../../../ecosimdata/results/e1_large/Deer
    keep_checkpoints:   5
    checkpoint_interval:    500000
    max_steps:  1000000000
    time_horizon:   1024
    summary_freq:   10
    threaded:   True
    self_play:  None
    behavioral_cloning: None
    framework:  pytorch
2021-05-04 10:02:38 INFO [torch_model_saver.py:96] Starting training from step 0 and saving to results/dynamics_1/Deer.
2021-05-04 10:02:38 INFO [stats.py:145] Hyperparameters for behavior name Wolf: 
    trainer_type:   ppo
    hyperparameters:    
      batch_size:   256
      buffer_size:  10240
      learning_rate:    0.0003
      beta: 0.005
      epsilon:  0.2
      lambd:    0.95
      num_epoch:    5
      learning_rate_schedule:   linear
    network_settings:   
      normalize:    False
      hidden_units: 256
      num_layers:   2
      vis_encode_type:  simple
      memory:   None
    reward_signals: 
      extrinsic:    
        gamma:  0.99
        strength:   1.0
    init_path:  ../../../ecosimdata/results/e1_large/Wolf
    keep_checkpoints:   5
    checkpoint_interval:    500000
    max_steps:  1000000000
    time_horizon:   1024
    summary_freq:   10
    threaded:   True
    self_play:  None
    behavioral_cloning: None
    framework:  pytorch
2021-05-04 10:02:38 INFO [torch_model_saver.py:96] Starting training from step 0 and saving to results/dynamics_1/Wolf.
2021-05-04 10:02:38 ERROR [_server.py:445] Exception calling application: Ran out of input
Traceback (most recent call last):
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/grpc/_server.py", line 435, in _call_behavior
    response_or_iterator = behavior(argument, context)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents_envs/rpc_communicator.py", line 30, in Exchange
    return self.child_conn.recv()
  File "/apps/Alvis/software/Compiler/GCCcore/10.2.0/Python/3.8.6/lib/python3.8/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
EOFError: Ran out of input
2021-05-04 10:02:38 ERROR [_server.py:445] Exception calling application: invalid load key, '\x04'.
Traceback (most recent call last):
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/grpc/_server.py", line 435, in _call_behavior
    response_or_iterator = behavior(argument, context)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents_envs/rpc_communicator.py", line 30, in Exchange
    return self.child_conn.recv()
  File "/apps/Alvis/software/Compiler/GCCcore/10.2.0/Python/3.8.6/lib/python3.8/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
_pickle.UnpicklingError: invalid load key, '\x04'.
2021-05-04 10:03:38 INFO [subprocess_env_manager.py:186] UnityEnvironment worker 0: environment stopping.
2021-05-04 10:03:38 INFO [environment.py:407] Environment shut down with return code 0.
2021-05-04 10:03:38 INFO [model_serialization.py:104] Converting to results/dynamics_1/Deer/Deer-0.onnx
2021-05-04 10:03:38 INFO [model_serialization.py:116] Exported results/dynamics_1/Deer/Deer-0.onnx
2021-05-04 10:03:38 INFO [torch_model_saver.py:116] Copied results/dynamics_1/Deer/Deer-0.onnx to results/dynamics_1/Deer.onnx.
2021-05-04 10:03:39 INFO [model_serialization.py:104] Converting to results/dynamics_1/Wolf/Wolf-0.onnx
2021-05-04 10:03:39 INFO [model_serialization.py:116] Exported results/dynamics_1/Wolf/Wolf-0.onnx
2021-05-04 10:03:39 INFO [torch_model_saver.py:116] Copied results/dynamics_1/Wolf/Wolf-0.onnx to results/dynamics_1/Wolf.onnx.
2021-05-04 10:03:39 INFO [trainer_controller.py:85] Saved Model
Traceback (most recent call last):
  File "/cephyr/users/tobiaka/Alvis/.local/bin/mlagents-learn", line 8, in <module>
    sys.exit(main())
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/learn.py", line 280, in main
    run_cli(parse_command_line())
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/learn.py", line 276, in run_cli
    run_training(run_seed, options)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/learn.py", line 153, in run_training
    tc.start_learning(env_manager)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/trainer_controller.py", line 176, in start_learning
    n_steps = self.advance(env_manager)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/trainer_controller.py", line 234, in advance
    new_step_infos = env_manager.get_steps()
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/env_manager.py", line 113, in get_steps
    new_step_infos = self._step()
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/subprocess_env_manager.py", line 276, in _step
    raise env_exception
mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
     The environment does not need user interaction to launch
     The Agents' Behavior Parameters > Behavior Type is set to "Default"
     The environment and the Python interface have compatible versions.

Second job: Player-0.log

Mono path[0] = '/cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Managed'
Mono config path = '/cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/MonoBleedingEdge/etc'
Preloaded 'lib_burst_generated.so'
Preloaded 'libgrpc_csharp_ext.x64.so'
Initialize engine version: 2020.1.9f1 (145f5172610f)
[Subsystems] Discovering subsystems at path /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/UnitySubsystems
Forcing GfxDevice: Null
GfxDevice: creating device client; threaded=0
NullGfxDevice:
    Version:  NULL 1.0 [1.0]
    Renderer: Null Device
    Vendor:   Unity Technologies
Begin MonoManager ReloadAssembly
- Completed reload, in  2.447 seconds
ERROR: Shader Sprites/Default shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader Sprites/Mask shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader GUI/Text Shader shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader Legacy Shaders/VertexLit shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard (Specular setup)' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
ERROR: Shader Standard (Specular setup) shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard (Specular setup)' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
ERROR: Shader Legacy Shaders/Particles/Alpha Blended Premultiply shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader FX/Water (Basic) shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
ERROR: Shader Standard shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
UnloadTime: 0.971674 ms
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libSystem.dylib
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libSystem.dylib.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libSystem.dylib
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libSystem.dylib
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libSystem.dylib.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libSystem.dylib
Couldn't connect to trainer on port 5005 using API version 1.0.0. Will perform inference instead.

Standard output:

2021-05-04 10:01:39.159955: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:From /cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2021-05-04 10:02:31 INFO [learn.py:275] run_seed set to 6525

                        ▄▄▄▓▓▓▓
                   ╓▓▓▓▓▓▓█▓▓▓▓▓
              ,▄▄▄m▀▀▀'  ,▓▓▓▀▓▓▄                           ▓▓▓  ▓▓▌
            ▄▓▓▓▀'      ▄▓▓▀  ▓▓▓      ▄▄     ▄▄ ,▄▄ ▄▄▄▄   ,▄▄ ▄▓▓▌▄ ▄▄▄    ,▄▄
          ▄▓▓▓▀        ▄▓▓▀   ▐▓▓▌     ▓▓▌   ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌  ╒▓▓▌
        ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓      ▓▀      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌   ▐▓▓▄ ▓▓▌
        ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄     ▓▓      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌    ▐▓▓▐▓▓
          ^█▓▓▓        ▀▓▓▄   ▐▓▓▌     ▓▓▓▓▄▓▓▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▓▄    ▓▓▓▓`
            '▀▓▓▓▄      ^▓▓▓  ▓▓▓       └▀▀▀▀ ▀▀ ^▀▀    `▀▀ `▀▀   '▀▀    ▐▓▓▌
               ▀▀▀▀▓▄▄▄   ▓▓▓▓▓▓,                                      ▓▓▓▓▀
                   `▀█▓▓▓▓▓▓▓▓▓▌
                        ¬`▀▀▀█▓

 Version information:
  ml-agents: 0.23.0,
  ml-agents-envs: 0.23.0,
  Communicator API: 1.3.0,
  PyTorch: 1.7.1
Found path: /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim.x86_64
2021-05-04 10:03:31 INFO [environment.py:409] Environment timed out shutting down. Killing...
2021-05-04 10:03:31 INFO [subprocess_env_manager.py:186] UnityEnvironment worker 0: environment stopping.
2021-05-04 10:03:31 INFO [trainer_controller.py:85] Saved Model
Traceback (most recent call last):
  File "/cephyr/users/tobiaka/Alvis/.local/bin/mlagents-learn", line 8, in <module>
    sys.exit(main())
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/learn.py", line 280, in main
    run_cli(parse_command_line())
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/learn.py", line 276, in run_cli
    run_training(run_seed, options)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/learn.py", line 153, in run_training
    tc.start_learning(env_manager)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/trainer_controller.py", line 174, in start_learning
    self._reset_env(env_manager)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents_envs/timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/trainer_controller.py", line 109, in _reset_env
    env_manager.reset(config=new_config)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/env_manager.py", line 67, in reset
    self.first_step_infos = self._reset_env(config)
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/subprocess_env_manager.py", line 299, in _reset_env
    ew.previous_step = EnvironmentStep(ew.recv().payload, ew.worker_id, {}, {})
  File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/subprocess_env_manager.py", line 95, in recv
    raise env_exception
mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
     The environment does not need user interaction to launch
     The Agents' Behavior Parameters > Behavior Type is set to "Default"
     The environment and the Python interface have compatible versions.

What is interesting is the last line in the player log for the second simulation, "Couldn't connect to trainer on port 5005 using API version 1.0.0. Will perform inference instead.". This is probably the cause of the error. The second job cant connect to the trainer and assumes inference only instead and this somehow also makes the first job fail.

vincentpierre commented 3 years ago

I do not know what is causing this. It could. be that both executables are trying to communicate on the same port and there is collision going on. you should try to play with the --base-port argument and make it different on both jobs.

Tobblatzius commented 3 years ago

This actually seems to solve the issue, great! I specified an individual --base-port for each job i submitted. Thanks @vincentpierre!

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.