Closed Tobblatzius closed 3 years ago
It looks like the environment stopped responding. I see two possibilities : Either one of the two executables never launched (There is something in the environment that prevents two environments to run at the same time on the same node) OR one of the environment somehow crashed (ran out of resources for example). Since the error seems to come from inside the Unity Environment, can you look into the executable logs? They might give some hints as to why this happens.
I will check the logs, I can't access them right now though due to the compute-server being down. Do you mean the log in run_logs that mlagents put in the results folder during a run? Do you have any idea what possible causes there might be in cases when one environment prevents two from running at the same time? It seems like the second possibility is less likely due to the huge amount of resources available and that the environment runs fine if it is the only Unity simulation running on the node (there can be other jobs running and it is fine, but not Unity jobs).
This is the output I get in the standard output and the results/simulation/Player-0.log, when I start two simulations on the same node. When they are startet simultanuously, both seem to fail to start.
For the first job
Player-0.log
Mono path[0] = '/cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Managed'
Mono config path = '/cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/MonoBleedingEdge/etc'
Preloaded 'lib_burst_generated.so'
Preloaded 'libgrpc_csharp_ext.x64.so'
Initialize engine version: 2020.1.9f1 (145f5172610f)
[Subsystems] Discovering subsystems at path /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/UnitySubsystems
Forcing GfxDevice: Null
GfxDevice: creating device client; threaded=0
NullGfxDevice:
Version: NULL 1.0 [1.0]
Renderer: Null Device
Vendor: Unity Technologies
Begin MonoManager ReloadAssembly
- Completed reload, in 1.909 seconds
ERROR: Shader Sprites/Default shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader Sprites/Mask shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader GUI/Text Shader shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader Legacy Shaders/VertexLit shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard (Specular setup)' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
ERROR: Shader Standard (Specular setup) shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard (Specular setup)' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
ERROR: Shader Legacy Shaders/Particles/Alpha Blended Premultiply shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader FX/Water (Basic) shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
ERROR: Shader Standard shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
UnloadTime: 0.759724 ms
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libSystem.dylib
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libSystem.dylib.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libSystem.dylib
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libSystem.dylib
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libSystem.dylib.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim_Data/Mono/libSystem.dylib
WARNING: The communication API versions between Unity and python differ at the minor version level. Python API: 1.3.0, Unity API: 1.0.0 Python Library Version: 0.23.0 .
This means that some features may not work unless you upgrade the package with the lower version.Please find the versions that work best together from our release page.
https://github.com/Unity-Technologies/ml-agents/releases
Setting up 16 worker threads for Enlighten.
Thread -> id: 2aec5c7fb700 -> priority: 1
Thread -> id: 2aec5c9fc700 -> priority: 1
Thread -> id: 2aec5cbfd700 -> priority: 1
Thread -> id: 2aec5cdfe700 -> priority: 1
Thread -> id: 2aec5cfff700 -> priority: 1
Thread -> id: 2aec5d200700 -> priority: 1
Thread -> id: 2aec5d401700 -> priority: 1
Thread -> id: 2aec5d602700 -> priority: 1
Thread -> id: 2aec5d803700 -> priority: 1
Thread -> id: 2aec5da04700 -> priority: 1
Thread -> id: 2aec5dc05700 -> priority: 1
Thread -> id: 2aec5de06700 -> priority: 1
Thread -> id: 2aec5e007700 -> priority: 1
Thread -> id: 2aec5e208700 -> priority: 1
Thread -> id: 2aec5e409700 -> priority: 1
Thread -> id: 2aec5e60a700 -> priority: 1
Standard output:
2021-05-04 10:01:32.186033: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:From /cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2021-05-04 10:02:31 INFO [learn.py:275] run_seed set to 5767
▄▄▄▓▓▓▓
╓▓▓▓▓▓▓█▓▓▓▓▓
,▄▄▄m▀▀▀' ,▓▓▓▀▓▓▄ ▓▓▓ ▓▓▌
▄▓▓▓▀' ▄▓▓▀ ▓▓▓ ▄▄ ▄▄ ,▄▄ ▄▄▄▄ ,▄▄ ▄▓▓▌▄ ▄▄▄ ,▄▄
▄▓▓▓▀ ▄▓▓▀ ▐▓▓▌ ▓▓▌ ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌ ╒▓▓▌
▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓ ▓▀ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▄ ▓▓▌
▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄ ▓▓ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▐▓▓
^█▓▓▓ ▀▓▓▄ ▐▓▓▌ ▓▓▓▓▄▓▓▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▓▄ ▓▓▓▓`
'▀▓▓▓▄ ^▓▓▓ ▓▓▓ └▀▀▀▀ ▀▀ ^▀▀ `▀▀ `▀▀ '▀▀ ▐▓▓▌
▀▀▀▀▓▄▄▄ ▓▓▓▓▓▓, ▓▓▓▓▀
`▀█▓▓▓▓▓▓▓▓▓▌
¬`▀▀▀█▓
Version information:
ml-agents: 0.23.0,
ml-agents-envs: 0.23.0,
Communicator API: 1.3.0,
PyTorch: 1.7.1
Found path: /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/1/ecosim.x86_64
2021-05-04 10:02:35 INFO [environment.py:110] Connected to Unity environment with package version 1.0.6 and communication version 1.0.0
2021-05-04 10:02:35 INFO [environment.py:271] Connected new brain:
Deer?team=0
2021-05-04 10:02:35 INFO [environment.py:271] Connected new brain:
Wolf?team=0
2021-05-04 10:02:35 INFO [stats.py:145] Hyperparameters for behavior name Deer:
trainer_type: ppo
hyperparameters:
batch_size: 256
buffer_size: 10240
learning_rate: 0.0003
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 5
learning_rate_schedule: linear
network_settings:
normalize: False
hidden_units: 256
num_layers: 2
vis_encode_type: simple
memory: None
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
init_path: ../../../ecosimdata/results/e1_large/Deer
keep_checkpoints: 5
checkpoint_interval: 500000
max_steps: 1000000000
time_horizon: 1024
summary_freq: 10
threaded: True
self_play: None
behavioral_cloning: None
framework: pytorch
2021-05-04 10:02:38 INFO [torch_model_saver.py:96] Starting training from step 0 and saving to results/dynamics_1/Deer.
2021-05-04 10:02:38 INFO [stats.py:145] Hyperparameters for behavior name Wolf:
trainer_type: ppo
hyperparameters:
batch_size: 256
buffer_size: 10240
learning_rate: 0.0003
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 5
learning_rate_schedule: linear
network_settings:
normalize: False
hidden_units: 256
num_layers: 2
vis_encode_type: simple
memory: None
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
init_path: ../../../ecosimdata/results/e1_large/Wolf
keep_checkpoints: 5
checkpoint_interval: 500000
max_steps: 1000000000
time_horizon: 1024
summary_freq: 10
threaded: True
self_play: None
behavioral_cloning: None
framework: pytorch
2021-05-04 10:02:38 INFO [torch_model_saver.py:96] Starting training from step 0 and saving to results/dynamics_1/Wolf.
2021-05-04 10:02:38 ERROR [_server.py:445] Exception calling application: Ran out of input
Traceback (most recent call last):
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/grpc/_server.py", line 435, in _call_behavior
response_or_iterator = behavior(argument, context)
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents_envs/rpc_communicator.py", line 30, in Exchange
return self.child_conn.recv()
File "/apps/Alvis/software/Compiler/GCCcore/10.2.0/Python/3.8.6/lib/python3.8/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
EOFError: Ran out of input
2021-05-04 10:02:38 ERROR [_server.py:445] Exception calling application: invalid load key, '\x04'.
Traceback (most recent call last):
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/grpc/_server.py", line 435, in _call_behavior
response_or_iterator = behavior(argument, context)
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents_envs/rpc_communicator.py", line 30, in Exchange
return self.child_conn.recv()
File "/apps/Alvis/software/Compiler/GCCcore/10.2.0/Python/3.8.6/lib/python3.8/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
_pickle.UnpicklingError: invalid load key, '\x04'.
2021-05-04 10:03:38 INFO [subprocess_env_manager.py:186] UnityEnvironment worker 0: environment stopping.
2021-05-04 10:03:38 INFO [environment.py:407] Environment shut down with return code 0.
2021-05-04 10:03:38 INFO [model_serialization.py:104] Converting to results/dynamics_1/Deer/Deer-0.onnx
2021-05-04 10:03:38 INFO [model_serialization.py:116] Exported results/dynamics_1/Deer/Deer-0.onnx
2021-05-04 10:03:38 INFO [torch_model_saver.py:116] Copied results/dynamics_1/Deer/Deer-0.onnx to results/dynamics_1/Deer.onnx.
2021-05-04 10:03:39 INFO [model_serialization.py:104] Converting to results/dynamics_1/Wolf/Wolf-0.onnx
2021-05-04 10:03:39 INFO [model_serialization.py:116] Exported results/dynamics_1/Wolf/Wolf-0.onnx
2021-05-04 10:03:39 INFO [torch_model_saver.py:116] Copied results/dynamics_1/Wolf/Wolf-0.onnx to results/dynamics_1/Wolf.onnx.
2021-05-04 10:03:39 INFO [trainer_controller.py:85] Saved Model
Traceback (most recent call last):
File "/cephyr/users/tobiaka/Alvis/.local/bin/mlagents-learn", line 8, in <module>
sys.exit(main())
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/learn.py", line 280, in main
run_cli(parse_command_line())
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/learn.py", line 276, in run_cli
run_training(run_seed, options)
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/learn.py", line 153, in run_training
tc.start_learning(env_manager)
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents_envs/timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/trainer_controller.py", line 176, in start_learning
n_steps = self.advance(env_manager)
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents_envs/timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/trainer_controller.py", line 234, in advance
new_step_infos = env_manager.get_steps()
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/env_manager.py", line 113, in get_steps
new_step_infos = self._step()
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/subprocess_env_manager.py", line 276, in _step
raise env_exception
mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
The environment does not need user interaction to launch
The Agents' Behavior Parameters > Behavior Type is set to "Default"
The environment and the Python interface have compatible versions.
Second job: Player-0.log
Mono path[0] = '/cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Managed'
Mono config path = '/cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/MonoBleedingEdge/etc'
Preloaded 'lib_burst_generated.so'
Preloaded 'libgrpc_csharp_ext.x64.so'
Initialize engine version: 2020.1.9f1 (145f5172610f)
[Subsystems] Discovering subsystems at path /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/UnitySubsystems
Forcing GfxDevice: Null
GfxDevice: creating device client; threaded=0
NullGfxDevice:
Version: NULL 1.0 [1.0]
Renderer: Null Device
Vendor: Unity Technologies
Begin MonoManager ReloadAssembly
- Completed reload, in 2.447 seconds
ERROR: Shader Sprites/Default shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader Sprites/Mask shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader GUI/Text Shader shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader Legacy Shaders/VertexLit shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard (Specular setup)' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
ERROR: Shader Standard (Specular setup) shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard (Specular setup)' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
ERROR: Shader Legacy Shaders/Particles/Alpha Blended Premultiply shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
ERROR: Shader FX/Water (Basic) shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
ERROR: Shader Standard shader is not supported on this GPU (none of subshaders/fallbacks are suitable)
WARNING: Shader Unsupported: 'Standard' - All subshaders removed
WARNING: Shader Did you use #pragma only_renderers and omit this platform?
WARNING: Shader If subshaders removal was intentional, you may have forgotten turning Fallback off?
UnloadTime: 0.971674 ms
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libSystem.dylib
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libSystem.dylib.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libSystem.dylib
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libcoreclr.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libSystem.dylib
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libSystem.dylib.so
Fallback handler could not load library /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim_Data/Mono/libSystem.dylib
Couldn't connect to trainer on port 5005 using API version 1.0.0. Will perform inference instead.
Standard output:
2021-05-04 10:01:39.159955: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:From /cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2021-05-04 10:02:31 INFO [learn.py:275] run_seed set to 6525
▄▄▄▓▓▓▓
╓▓▓▓▓▓▓█▓▓▓▓▓
,▄▄▄m▀▀▀' ,▓▓▓▀▓▓▄ ▓▓▓ ▓▓▌
▄▓▓▓▀' ▄▓▓▀ ▓▓▓ ▄▄ ▄▄ ,▄▄ ▄▄▄▄ ,▄▄ ▄▓▓▌▄ ▄▄▄ ,▄▄
▄▓▓▓▀ ▄▓▓▀ ▐▓▓▌ ▓▓▌ ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌ ╒▓▓▌
▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓ ▓▀ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▄ ▓▓▌
▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄ ▓▓ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▐▓▓
^█▓▓▓ ▀▓▓▄ ▐▓▓▌ ▓▓▓▓▄▓▓▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▓▄ ▓▓▓▓`
'▀▓▓▓▄ ^▓▓▓ ▓▓▓ └▀▀▀▀ ▀▀ ^▀▀ `▀▀ `▀▀ '▀▀ ▐▓▓▌
▀▀▀▀▓▄▄▄ ▓▓▓▓▓▓, ▓▓▓▓▀
`▀█▓▓▓▓▓▓▓▓▓▌
¬`▀▀▀█▓
Version information:
ml-agents: 0.23.0,
ml-agents-envs: 0.23.0,
Communicator API: 1.3.0,
PyTorch: 1.7.1
Found path: /cephyr/users/tobiaka/Alvis/ecosim-predator-prey/builds/dynamics_e1/2/ecosim.x86_64
2021-05-04 10:03:31 INFO [environment.py:409] Environment timed out shutting down. Killing...
2021-05-04 10:03:31 INFO [subprocess_env_manager.py:186] UnityEnvironment worker 0: environment stopping.
2021-05-04 10:03:31 INFO [trainer_controller.py:85] Saved Model
Traceback (most recent call last):
File "/cephyr/users/tobiaka/Alvis/.local/bin/mlagents-learn", line 8, in <module>
sys.exit(main())
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/learn.py", line 280, in main
run_cli(parse_command_line())
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/learn.py", line 276, in run_cli
run_training(run_seed, options)
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/learn.py", line 153, in run_training
tc.start_learning(env_manager)
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents_envs/timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/trainer_controller.py", line 174, in start_learning
self._reset_env(env_manager)
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents_envs/timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/trainer_controller.py", line 109, in _reset_env
env_manager.reset(config=new_config)
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/env_manager.py", line 67, in reset
self.first_step_infos = self._reset_env(config)
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/subprocess_env_manager.py", line 299, in _reset_env
ew.previous_step = EnvironmentStep(ew.recv().payload, ew.worker_id, {}, {})
File "/cephyr/users/tobiaka/Alvis/.local/lib/python3.8/site-packages/mlagents/trainers/subprocess_env_manager.py", line 95, in recv
raise env_exception
mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
The environment does not need user interaction to launch
The Agents' Behavior Parameters > Behavior Type is set to "Default"
The environment and the Python interface have compatible versions.
What is interesting is the last line in the player log for the second simulation, "Couldn't connect to trainer on port 5005 using API version 1.0.0. Will perform inference instead.". This is probably the cause of the error. The second job cant connect to the trainer and assumes inference only instead and this somehow also makes the first job fail.
I do not know what is causing this. It could. be that both executables are trying to communicate on the same port and there is collision going on. you should try to play with the --base-port
argument and make it different on both jobs.
This actually seems to solve the issue, great! I specified an individual --base-port
for each job i submitted. Thanks @vincentpierre!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I run Unity-simulations with ML-agents on a hpc-cluster. I build my simulations in headless mode. When I run two simulations on the same node, one of them will always crash within a few minutes, giving me this error message. If the two simulations are on different nodes, this is not a problem. The two simulations are not close to maxing out the RAM, CPU or GPU usage of the node they are run at. Each node has 2 x 16 core Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz (total 32 cores) and 8 x Nvidia Tesla T4 GPU with 16GB RAM. I use mlagents version 0.23.0. I am not sure how to proceed.