Closed qiwu57kevin closed 4 years ago
Hi,
It would help greatly if I had a way to reproduce this error. Please fill out the To Reproduce
section of the template.
The error seems to come from the tensorboard summary writer trying to write something not allowed when writing the hyperparameters. Can you provide the config yaml you used?
I have updated and attached the config I used.
I am unable to reproduce the bug. The steps to reproduce seem to be about running 3DBall in a new virtual environment with the provided config (which works for me). The error you are getting (I am guessing) is due the summary writers being unable to deserialize the hyperparameters. I have never seen this error, and I need more information. From the error, it seems there is a "Gesture" behavior in the Unity scene, but it is not mentioned in the steps to reproduce.
The fact that you had it work before and now it does not, indicates that you might have something gone wrong when installing. I would recommend installing from scratch again or make sure no files have been modified (git status
)
Can you provide more details to help me reproduce the error?
The "Gesture" behavior is my own training behavior. I thought initially the error comes from my own environment, but later when I tested it on the example environment, the same thing happened. I have updated the error message generated by the 3DBall behavior. I will try to reinstall everything from scratch and see what happens.
I think it would help us if you told us what version of Python you are using in your virtual environment and the sha of the git commit you are on. The demo environments all work on my machine, so the error is probably an installation issue.
The python version I am using for my virtual environment is 3.7.6. I installed mlagents 0.18.0
and mlagents-envs 0.18.0
using pip install mlagents
. The pip version is 20.1.1. I didn't clone the git repo and install from there. I followed every step as instructed but it just couldn't get me work. I have reinstalled for a few times and the same thing happens. Do you think this can be a problem with python version?
I tried to do a fresh install with Python 3.7.6, pip 20.1.1 and installed pip3 mlagents==0.18.0
. I was able to train 3DBall without problem. Are you sure you used a clean new virtual environment ?
can you post the result of the command pip3 freeze
into this issue?
Here is what I get from pip3 freeze
:
(mlagents-env) D:\ML-Agents\mlagents-env\Scripts>pip3 freeze
WARNING: Could not generate requirement for distribution -ip 19.2.3 (d:\ml-agents\mlagents-env\lib\site-packages): Parse error at "'-ip==19.'": Expected W:(abcd...)
absl-py==0.9.0
astunparse==1.6.3
attrs==19.3.0
cachetools==4.1.1
cattrs==1.0.0
certifi==2020.6.20
chardet==3.0.4
cloudpickle==1.5.0
gast==0.3.3
google-auth==1.19.2
google-auth-oauthlib==0.4.1
google-pasta==0.2.0
grpcio==1.30.0
h5py==2.10.0
idna==2.10
importlib-metadata==1.7.0
Keras-Preprocessing==1.1.2
Markdown==3.2.2
mlagents==0.18.0
mlagents-envs==0.18.0
numpy==1.19.1
oauthlib==3.1.0
opt-einsum==3.3.0
pi==0.1.2
Pillow==7.2.0
protobuf==3.12.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
pypiwin32==223
pywin32==228
PyYAML==5.3.1
requests==2.24.0
requests-oauthlib==1.3.0
rsa==4.6
scipy==1.4.1
six==1.15.0
tensorboard==2.2.2
tensorboard-plugin-wit==1.7.0
tensorflow==2.2.0
tensorflow-estimator==2.2.0
termcolor==1.1.0
urllib3==1.25.9
Werkzeug==1.0.1
wrapt==1.12.1
zipp==3.1.0
Okay, I got the same pip configuration as you but It runs for me. On the bright side, I think I got something. The fact that you got the message
2020-07-21 18:20:54 WARNING [stats.py:235] Could not write text summary for Tensorboard.
means that the try/catch here failed and the string that was returned was a ""
.
I have no idea why you have an error there and I don't BUT any error in this try / catch would give the error you are seeing.
I will try to make some fixes but without knowing what the original error is, I am not sure I can do much.
If you could try to do the git installation by cloning the repo and reproduce the error, I can guide you to make some debug statements that will help.
I have tried to install mlagents
and mlagents-env
by cloning the repo. The pip3 freeze
message including the commit sha is shown below:
(mlagents-env) D:\ML-Agents\ml-agents\ml-agents>pip3 freeze
WARNING: Could not generate requirement for distribution -ip 19.2.3 (d:\ml-agents\mlagents-env\lib\site-packages): Parse error at "'-ip==19.'": Expected W:(abcd...)
absl-py==0.9.0
astunparse==1.6.3
attrs==19.3.0
cachetools==4.1.1
cattrs==1.0.0
certifi==2020.6.20
chardet==3.0.4
cloudpickle==1.5.0
gast==0.3.3
google-auth==1.19.2
google-auth-oauthlib==0.4.1
google-pasta==0.2.0
grpcio==1.30.0
h5py==2.10.0
idna==2.10
importlib-metadata==1.7.0
Keras-Preprocessing==1.1.2
Markdown==3.2.2
-e git+https://github.com/Unity-Technologies/ml-agents.git@8327ddcb2a65ccb0a76ce6390811212d1daebb6e#egg=mlagents&subdirectory=ml-agents
-e git+https://github.com/Unity-Technologies/ml-agents.git@8327ddcb2a65ccb0a76ce6390811212d1daebb6e#egg=mlagents_envs&subdirectory=ml-agents-envs
numpy==1.19.1
oauthlib==3.1.0
opt-einsum==3.3.0
pi==0.1.2
Pillow==7.2.0
protobuf==3.12.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
pypiwin32==223
pywin32==228
PyYAML==5.3.1
requests==2.24.0
requests-oauthlib==1.3.0
rsa==4.6
scipy==1.4.1
six==1.15.0
tensorboard==2.2.2
tensorboard-plugin-wit==1.7.0
tensorflow==2.2.0
tensorflow-estimator==2.2.0
termcolor==1.1.0
urllib3==1.25.9
Werkzeug==1.0.1
wrapt==1.12.1
zipp==3.1.0
And the error still exists, the message is:
(mlagents-env) D:\ML-Agents\ml-agents\config>mlagents-learn ppo\3DBall.yaml --run-id=3DBall_testrun
2020-07-23 16:30:43.316650: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
WARNING:tensorflow:From d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
▄▄▄▓▓▓▓
╓▓▓▓▓▓▓█▓▓▓▓▓
,▄▄▄m▀▀▀' ,▓▓▓▀▓▓▄ ▓▓▓ ▓▓▌
▄▓▓▓▀' ▄▓▓▀ ▓▓▓ ▄▄ ▄▄ ,▄▄ ▄▄▄▄ ,▄▄ ▄▓▓▌▄ ▄▄▄ ,▄▄
▄▓▓▓▀ ▄▓▓▀ ▐▓▓▌ ▓▓▌ ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌ ╒▓▓▌
▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓ ▓▀ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▄ ▓▓▌
▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄ ▓▓ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▐▓▓
^█▓▓▓ ▀▓▓▄ ▐▓▓▌ ▓▓▓▓▄▓▓▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▓▄ ▓▓▓▓`
'▀▓▓▓▄ ^▓▓▓ ▓▓▓ └▀▀▀▀ ▀▀ ^▀▀ `▀▀ `▀▀ '▀▀ ▐▓▓▌
▀▀▀▀▓▄▄▄ ▓▓▓▓▓▓, ▓▓▓▓▀
`▀█▓▓▓▓▓▓▓▓▓▌
¬`▀▀▀█▓
Version information:
ml-agents: 0.18.0,
ml-agents-envs: 0.18.0,
Communicator API: 1.0.0,
TensorFlow: 2.2.0
2020-07-23 16:30:45.848914: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
WARNING:tensorflow:From d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2020-07-23 16:30:47 INFO [environment.py:199] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
2020-07-23 16:31:09 INFO [environment.py:108] Connected to Unity environment with package version 1.0.3 and communication version 1.0.0
2020-07-23 16:31:10 INFO [environment.py:265] Connected new brain:
3DBall?team=0
2020-07-23 16:31:10.033423: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-23 16:31:10.046351: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2c74563a0a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-23 16:31:10.054787: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-07-23 16:31:10.059908: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-07-23 16:31:10.228271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 15 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 238.66GiB/s
2020-07-23 16:31:10.237983: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-23 16:31:10.247545: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-23 16:31:10.256215: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-23 16:31:10.261961: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-23 16:31:10.271281: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-23 16:31:10.278897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-23 16:31:10.290463: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-23 16:31:10.301647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-07-23 16:31:10 WARNING [stats.py:235] Could not write text summary for Tensorboard.
2020-07-23 16:31:10 INFO [trainer_controller.py:76] Saved Model
Traceback (most recent call last):
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 130, in _create_trainer_and_manager
trainer = self.trainers[brain_name]
KeyError: '3DBall'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\ML-Agents\mlagents-env\Scripts\mlagents-learn-script.py", line 33, in <module>
sys.exit(load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')())
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\learn.py", line 283, in main
run_cli(parse_command_line())
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\learn.py", line 279, in run_cli
run_training(run_seed, options)
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\learn.py", line 158, in run_training
tc.start_learning(env_manager)
File "d:\ml-agents\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 181, in start_learning
self._create_trainers_and_managers(env_manager, new_behavior_ids)
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 168, in _create_trainers_and_managers
self._create_trainer_and_manager(env_manager, behavior_id)
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 132, in _create_trainer_and_manager
trainer = self.trainer_factory.generate(brain_name)
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_util.py", line 52, in generate
self.multi_gpu,
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_util.py", line 101, in initialize_trainer
trainer_artifact_path,
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\ppo\trainer.py", line 48, in __init__
brain_name, trainer_settings, training, artifact_path, reward_buff_cap
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer\rl_trainer.py", line 38, in __init__
StatsPropertyType.HYPERPARAMETERS, self.trainer_settings.as_dict()
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\stats.py", line 321, in add_property
writer.add_property(self.category, property_type, value)
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\stats.py", line 216, in add_property
self.summary_writers[category].add_summary(text, 0)
File "d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\summary\writer\writer.py", line 127, in add_summary
for value in summary.value:
AttributeError: 'str' object has no attribute 'value'
Ok, this was to be expected. Can you checkout master and pull? I made changes on master that should fix the issue and will print some more details logs about why this is happening. Make sure to run pip3 install -e .
as the -e
will allow changes made on the repo to be reflected in your packages
I have checkout master and pull. However, during handling, another error occured:
(mlagents-env) D:\ML-Agents\ml-agents\config>mlagents-learn ppo/3DBall.yaml --run-id=3DBall_test --force
2020-07-23 18:20:23.054329: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
WARNING:tensorflow:From d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
▄▄▄▓▓▓▓
╓▓▓▓▓▓▓█▓▓▓▓▓
,▄▄▄m▀▀▀' ,▓▓▓▀▓▓▄ ▓▓▓ ▓▓▌
▄▓▓▓▀' ▄▓▓▀ ▓▓▓ ▄▄ ▄▄ ,▄▄ ▄▄▄▄ ,▄▄ ▄▓▓▌▄ ▄▄▄ ,▄▄
▄▓▓▓▀ ▄▓▓▀ ▐▓▓▌ ▓▓▌ ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌ ╒▓▓▌
▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓ ▓▀ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▄ ▓▓▌
▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄ ▓▓ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▐▓▓
^█▓▓▓ ▀▓▓▄ ▐▓▓▌ ▓▓▓▓▄▓▓▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▓▄ ▓▓▓▓`
'▀▓▓▓▄ ^▓▓▓ ▓▓▓ └▀▀▀▀ ▀▀ ^▀▀ `▀▀ `▀▀ '▀▀ ▐▓▓▌
▀▀▀▀▓▄▄▄ ▓▓▓▓▓▓, ▓▓▓▓▀
`▀█▓▓▓▓▓▓▓▓▓▌
¬`▀▀▀█▓
Version information:
ml-agents: 0.19.0.dev0,
ml-agents-envs: 0.19.0.dev0,
Communicator API: 1.0.0,
TensorFlow: 2.2.0
2020-07-23 18:20:25.646658: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
WARNING:tensorflow:From d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2020-07-23 18:20:27 INFO [environment.py:199] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
2020-07-23 18:20:29 INFO [environment.py:108] Connected to Unity environment with package version 1.2.0-preview and communication version 1.0.0
2020-07-23 18:20:29 INFO [environment.py:265] Connected new brain:
3DBall?team=0
2020-07-23 18:20:29.467836: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-23 18:20:29.479564: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2c24bb370f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-23 18:20:29.485485: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-07-23 18:20:29.490874: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-07-23 18:20:29.654236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 15 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 238.66GiB/s
2020-07-23 18:20:29.662758: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-23 18:20:29.671851: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-23 18:20:29.680138: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-23 18:20:29.686172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-23 18:20:29.694907: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-23 18:20:29.702204: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-23 18:20:29.711580: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-23 18:20:29.720371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-07-23 18:20:29 WARNING [stats.py:239] Could not write Hyperparameters summary for Tensorboard: {'trainer_type': 'ppo', 'hyperparameters': {'batch_size': 64, 'buffer_size': 12000, 'learning_rate': 0.0003, 'beta': 0.001, 'epsilon': 0.2, 'lambd': 0.99, 'num_epoch': 3, 'learning_rate_schedule': 'linear'}, 'network_settings': {'normalize': True, 'hidden_units': 128, 'num_layers': 2, 'vis_encode_type': 'simple', 'memory': None}, 'reward_signals': {'extrinsic': {'gamma': 0.99, 'strength': 1.0}}, 'init_path': None, 'keep_checkpoints': 5, 'checkpoint_interval': 500000, 'max_steps': 500000, 'time_horizon': 1000, 'summary_freq': 12000, 'threaded': True, 'self_play': None, 'behavioral_cloning': None}
2020-07-23 18:20:29 INFO [stats.py:131] Hyperparameters for behavior name 3DBall:
trainer_type: ppo
hyperparameters:
batch_size: 64
buffer_size: 12000
learning_rate: 0.0003
beta: 0.001
epsilon: 0.2
lambd: 0.99
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: True
hidden_units: 128
num_layers: 2
vis_encode_type: simple
memory: None
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
init_path: None
keep_checkpoints: 5
checkpoint_interval: 500000
max_steps: 500000
time_horizon: 1000
summary_freq: 12000
threaded: True
self_play: None
behavioral_cloning: None
2020-07-23 18:20:29.729412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 15 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 238.66GiB/s
2020-07-23 18:20:29.737810: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-23 18:20:29.742400: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-23 18:20:29.746056: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-23 18:20:29.750547: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-23 18:20:29.754533: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-23 18:20:29.759745: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-23 18:20:29.763789: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-23 18:20:29.771611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
Traceback (most recent call last):
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 177, in start_learning
self._reset_env(env_manager)
File "d:\ml-agents\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 115, in _reset_env
self._register_new_behaviors(env_manager, env_manager.first_step_infos)
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 271, in _register_new_behaviors
self._create_trainers_and_managers(env_manager, new_behavior_ids)
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 169, in _create_trainers_and_managers
self._create_trainer_and_manager(env_manager, behavior_id)
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 143, in _create_trainer_and_manager
parsed_behavior_id, env_manager.training_behaviors[name_behavior_id]
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\ppo\trainer.py", line 205, in create_policy
create_tf_graph=False, # We will create the TF graph in the Optimizer
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\policy\tf_policy.py", line 89, in __init__
config=tf_utils.generate_session_config(), graph=self.graph
File "d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\client\session.py", line 1586, in __init__
super(Session, self).__init__(target, graph, config=config)
File "d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\client\session.py", line 701, in __init__
self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\ML-Agents\mlagents-env\Scripts\mlagents-learn-script.py", line 33, in <module>
sys.exit(load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')())
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\learn.py", line 284, in main
run_cli(parse_command_line())
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\learn.py", line 280, in run_cli
run_training(run_seed, options)
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\learn.py", line 159, in run_training
tc.start_learning(env_manager)
File "d:\ml-agents\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 204, in start_learning
self._save_models()
File "d:\ml-agents\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 75, in _save_models
self.trainers[brain_name].save_model()
File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer\rl_trainer.py", line 133, in save_model
policy = list(self.policies.values())[0]
IndexError: list index out of range
Okay, I think the two issues are related. In the try/catch that was fixed on master, the culprit was probably the call to generate_session_config()
. You can see the method fails as well this time :
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
Which means this is a CUDA issue. The installation of ml-agents does not do any GPU setup, so my guess is that you have a CUDA configuration somewhere that causes this issue.
You can try setting export CUDA_VISIBLE_DEVICES=-1
in the terminal to force TensorFlow to use the CPU or you can try to reinstall CUDA. Have you used CUDA in other projects ?
I just noticed the CUDA error! Yes, it is the driver issue. I have updated the GPU driver to the latest and it can now get trained successfully!
Thank you so much for the help Vincent!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Describe the bug I am using mlagents 0.18.0. While I setup everything and started training with provided example environments, it keeps giving me KeyError from trainer and AttributeError from tensorflow. I used the same setup from the same desktop about 2 days and everything works well, but it couldn't work in my current device.
To Reproduce Steps to reproduce the behavior:
Console logs / stack traces
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
NOTE: We are unable to help reproduce bugs with custom environments. Please attempt to reproduce your issue with one of the example environments, or provide a minimal patch to one of the environments needed to reproduce the issue.