KeyError and AttributeError using MLAgents

qiwu57kevin commented 4 years ago

Describe the bug I am using mlagents 0.18.0. While I setup everything and started training with provided example environments, it keeps giving me KeyError from trainer and AttributeError from tensorflow. I used the same setup from the same desktop about 2 days and everything works well, but it couldn't work in my current device.

To Reproduce Steps to reproduce the behavior:

I am using Unity editor to train 3DBall example environment. The shell script is: _mlagents-learn ppo/3DBall.yaml --run-id=3DBalltest It is started from a virtual python environment.

The config file I used is below, which is included in the git repo:

behaviors:
3DBall:
trainer_type: ppo
hyperparameters:
  batch_size: 64
  buffer_size: 12000
  learning_rate: 0.0003
  beta: 0.001
  epsilon: 0.2
  lambd: 0.99
  num_epoch: 3
  learning_rate_schedule: linear
network_settings:
  normalize: true
  hidden_units: 128
  num_layers: 2
  vis_encode_type: simple
reward_signals:
  extrinsic:
    gamma: 0.99
    strength: 1.0
keep_checkpoints: 5
max_steps: 500000
time_horizon: 1000
summary_freq: 12000
threaded: true

Console logs / stack traces

(mlagents-env) D:\ML-Agents\ml-agents\config>mlagents-learn ppo/3DBall.yaml --run-id=3DBall_test --force
2020-07-21 18:20:45.848482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
WARNING:tensorflow:From d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term

                        ▄▄▄▓▓▓▓
                   ╓▓▓▓▓▓▓█▓▓▓▓▓
              ,▄▄▄m▀▀▀'  ,▓▓▓▀▓▓▄                           ▓▓▓  ▓▓▌
            ▄▓▓▓▀'      ▄▓▓▀  ▓▓▓      ▄▄     ▄▄ ,▄▄ ▄▄▄▄   ,▄▄ ▄▓▓▌▄ ▄▄▄    ,▄▄
          ▄▓▓▓▀        ▄▓▓▀   ▐▓▓▌     ▓▓▌   ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌  ╒▓▓▌
        ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓      ▓▀      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌   ▐▓▓▄ ▓▓▌
        ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄     ▓▓      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌    ▐▓▓▐▓▓
          ^█▓▓▓        ▀▓▓▄   ▐▓▓▌     ▓▓▓▓▄▓▓▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▓▄    ▓▓▓▓`
            '▀▓▓▓▄      ^▓▓▓  ▓▓▓       └▀▀▀▀ ▀▀ ^▀▀    `▀▀ `▀▀   '▀▀    ▐▓▓▌
               ▀▀▀▀▓▄▄▄   ▓▓▓▓▓▓,                                      ▓▓▓▓▀
                   `▀█▓▓▓▓▓▓▓▓▓▌
                        ¬`▀▀▀█▓

 Version information:
  ml-agents: 0.18.0,
  ml-agents-envs: 0.18.0,
  Communicator API: 1.0.0,
  TensorFlow: 2.2.0
2020-07-21 18:20:48.950830: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
WARNING:tensorflow:From d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2020-07-21 18:20:50 INFO [environment.py:199] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
2020-07-21 18:20:53 INFO [environment.py:108] Connected to Unity environment with package version 1.0.3 and communication version 1.0.0
2020-07-21 18:20:53 INFO [environment.py:265] Connected new brain:
3DBall?team=0
2020-07-21 18:20:53.860642: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-21 18:20:53.872630: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2483e493c30 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-21 18:20:53.878502: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-07-21 18:20:53.996742: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-07-21 18:20:54.180968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 15 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 238.66GiB/s
2020-07-21 18:20:54.188751: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-21 18:20:54.225479: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-21 18:20:54.244696: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-21 18:20:54.263115: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-21 18:20:54.288772: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-21 18:20:54.310554: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-21 18:20:54.529632: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-21 18:20:54.537809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-07-21 18:20:54 WARNING [stats.py:235] Could not write text summary for Tensorboard.
2020-07-21 18:20:54 INFO [trainer_controller.py:76] Saved Model
Traceback (most recent call last):
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 130, in _create_trainer_and_manager
    trainer = self.trainers[brain_name]
KeyError: '3DBall'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\ML-Agents\mlagents-env\Scripts\mlagents-learn-script.py", line 33, in <module>
    sys.exit(load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')())
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\learn.py", line 283, in main
    run_cli(parse_command_line())
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\learn.py", line 279, in run_cli
    run_training(run_seed, options)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\learn.py", line 158, in run_training
    tc.start_learning(env_manager)
  File "d:\ml-agents\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 181, in start_learning
    self._create_trainers_and_managers(env_manager, new_behavior_ids)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 168, in _create_trainers_and_managers
    self._create_trainer_and_manager(env_manager, behavior_id)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 132, in _create_trainer_and_manager
    trainer = self.trainer_factory.generate(brain_name)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_util.py", line 52, in generate
    self.multi_gpu,
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_util.py", line 101, in initialize_trainer
    trainer_artifact_path,
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\ppo\trainer.py", line 48, in __init__
    brain_name, trainer_settings, training, artifact_path, reward_buff_cap
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer\rl_trainer.py", line 38, in __init__
    StatsPropertyType.HYPERPARAMETERS, self.trainer_settings.as_dict()
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\stats.py", line 321, in add_property
    writer.add_property(self.category, property_type, value)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\stats.py", line 216, in add_property
    self.summary_writers[category].add_summary(text, 0)
  File "d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\summary\writer\writer.py", line 127, in add_summary
    for value in summary.value:
AttributeError: 'str' object has no attribute 'value'

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Unity Version: Unity 2019.4.4f1
OS + version: Windows 10
ML-Agents version: v0.18.0
TensorFlow version: 2.2.0
Environment: 3DBall (actually every env)

NOTE: We are unable to help reproduce bugs with custom environments. Please attempt to reproduce your issue with one of the example environments, or provide a minimal patch to one of the environments needed to reproduce the issue.

vincentpierre commented 4 years ago

Hi, It would help greatly if I had a way to reproduce this error. Please fill out the To Reproduce section of the template. The error seems to come from the tensorboard summary writer trying to write something not allowed when writing the hyperparameters. Can you provide the config yaml you used?

qiwu57kevin commented 4 years ago

I have updated and attached the config I used.

vincentpierre commented 4 years ago

I am unable to reproduce the bug. The steps to reproduce seem to be about running 3DBall in a new virtual environment with the provided config (which works for me). The error you are getting (I am guessing) is due the summary writers being unable to deserialize the hyperparameters. I have never seen this error, and I need more information. From the error, it seems there is a "Gesture" behavior in the Unity scene, but it is not mentioned in the steps to reproduce. The fact that you had it work before and now it does not, indicates that you might have something gone wrong when installing. I would recommend installing from scratch again or make sure no files have been modified (git status) Can you provide more details to help me reproduce the error?

qiwu57kevin commented 4 years ago

The "Gesture" behavior is my own training behavior. I thought initially the error comes from my own environment, but later when I tested it on the example environment, the same thing happened. I have updated the error message generated by the 3DBall behavior. I will try to reinstall everything from scratch and see what happens.

vincentpierre commented 4 years ago

I think it would help us if you told us what version of Python you are using in your virtual environment and the sha of the git commit you are on. The demo environments all work on my machine, so the error is probably an installation issue.

qiwu57kevin commented 4 years ago

The python version I am using for my virtual environment is 3.7.6. I installed mlagents 0.18.0 and mlagents-envs 0.18.0 using pip install mlagents. The pip version is 20.1.1. I didn't clone the git repo and install from there. I followed every step as instructed but it just couldn't get me work. I have reinstalled for a few times and the same thing happens. Do you think this can be a problem with python version?

vincentpierre commented 4 years ago

I tried to do a fresh install with Python 3.7.6, pip 20.1.1 and installed pip3 mlagents==0.18.0. I was able to train 3DBall without problem. Are you sure you used a clean new virtual environment ? can you post the result of the command pip3 freeze into this issue?

qiwu57kevin commented 4 years ago

Here is what I get from pip3 freeze:

(mlagents-env) D:\ML-Agents\mlagents-env\Scripts>pip3 freeze
WARNING: Could not generate requirement for distribution -ip 19.2.3 (d:\ml-agents\mlagents-env\lib\site-packages): Parse error at "'-ip==19.'": Expected W:(abcd...)
absl-py==0.9.0
astunparse==1.6.3
attrs==19.3.0
cachetools==4.1.1
cattrs==1.0.0
certifi==2020.6.20
chardet==3.0.4
cloudpickle==1.5.0
gast==0.3.3
google-auth==1.19.2
google-auth-oauthlib==0.4.1
google-pasta==0.2.0
grpcio==1.30.0
h5py==2.10.0
idna==2.10
importlib-metadata==1.7.0
Keras-Preprocessing==1.1.2
Markdown==3.2.2
mlagents==0.18.0
mlagents-envs==0.18.0
numpy==1.19.1
oauthlib==3.1.0
opt-einsum==3.3.0
pi==0.1.2
Pillow==7.2.0
protobuf==3.12.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
pypiwin32==223
pywin32==228
PyYAML==5.3.1
requests==2.24.0
requests-oauthlib==1.3.0
rsa==4.6
scipy==1.4.1
six==1.15.0
tensorboard==2.2.2
tensorboard-plugin-wit==1.7.0
tensorflow==2.2.0
tensorflow-estimator==2.2.0
termcolor==1.1.0
urllib3==1.25.9
Werkzeug==1.0.1
wrapt==1.12.1
zipp==3.1.0

vincentpierre commented 4 years ago

Okay, I got the same pip configuration as you but It runs for me. On the bright side, I think I got something. The fact that you got the message

2020-07-21 18:20:54 WARNING [stats.py:235] Could not write text summary for Tensorboard.

means that the try/catch here failed and the string that was returned was a "". I have no idea why you have an error there and I don't BUT any error in this try / catch would give the error you are seeing. I will try to make some fixes but without knowing what the original error is, I am not sure I can do much. If you could try to do the git installation by cloning the repo and reproduce the error, I can guide you to make some debug statements that will help.

qiwu57kevin commented 4 years ago

I have tried to install mlagents and mlagents-env by cloning the repo. The pip3 freeze message including the commit sha is shown below:

(mlagents-env) D:\ML-Agents\ml-agents\ml-agents>pip3 freeze
WARNING: Could not generate requirement for distribution -ip 19.2.3 (d:\ml-agents\mlagents-env\lib\site-packages): Parse error at "'-ip==19.'": Expected W:(abcd...)
absl-py==0.9.0
astunparse==1.6.3
attrs==19.3.0
cachetools==4.1.1
cattrs==1.0.0
certifi==2020.6.20
chardet==3.0.4
cloudpickle==1.5.0
gast==0.3.3
google-auth==1.19.2
google-auth-oauthlib==0.4.1
google-pasta==0.2.0
grpcio==1.30.0
h5py==2.10.0
idna==2.10
importlib-metadata==1.7.0
Keras-Preprocessing==1.1.2
Markdown==3.2.2
-e git+https://github.com/Unity-Technologies/ml-agents.git@8327ddcb2a65ccb0a76ce6390811212d1daebb6e#egg=mlagents&subdirectory=ml-agents
-e git+https://github.com/Unity-Technologies/ml-agents.git@8327ddcb2a65ccb0a76ce6390811212d1daebb6e#egg=mlagents_envs&subdirectory=ml-agents-envs
numpy==1.19.1
oauthlib==3.1.0
opt-einsum==3.3.0
pi==0.1.2
Pillow==7.2.0
protobuf==3.12.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
pypiwin32==223
pywin32==228
PyYAML==5.3.1
requests==2.24.0
requests-oauthlib==1.3.0
rsa==4.6
scipy==1.4.1
six==1.15.0
tensorboard==2.2.2
tensorboard-plugin-wit==1.7.0
tensorflow==2.2.0
tensorflow-estimator==2.2.0
termcolor==1.1.0
urllib3==1.25.9
Werkzeug==1.0.1
wrapt==1.12.1
zipp==3.1.0

And the error still exists, the message is:

(mlagents-env) D:\ML-Agents\ml-agents\config>mlagents-learn ppo\3DBall.yaml --run-id=3DBall_testrun
2020-07-23 16:30:43.316650: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
WARNING:tensorflow:From d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term

                        ▄▄▄▓▓▓▓
                   ╓▓▓▓▓▓▓█▓▓▓▓▓
              ,▄▄▄m▀▀▀'  ,▓▓▓▀▓▓▄                           ▓▓▓  ▓▓▌
            ▄▓▓▓▀'      ▄▓▓▀  ▓▓▓      ▄▄     ▄▄ ,▄▄ ▄▄▄▄   ,▄▄ ▄▓▓▌▄ ▄▄▄    ,▄▄
          ▄▓▓▓▀        ▄▓▓▀   ▐▓▓▌     ▓▓▌   ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌  ╒▓▓▌
        ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓      ▓▀      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌   ▐▓▓▄ ▓▓▌
        ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄     ▓▓      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌    ▐▓▓▐▓▓
          ^█▓▓▓        ▀▓▓▄   ▐▓▓▌     ▓▓▓▓▄▓▓▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▓▄    ▓▓▓▓`
            '▀▓▓▓▄      ^▓▓▓  ▓▓▓       └▀▀▀▀ ▀▀ ^▀▀    `▀▀ `▀▀   '▀▀    ▐▓▓▌
               ▀▀▀▀▓▄▄▄   ▓▓▓▓▓▓,                                      ▓▓▓▓▀
                   `▀█▓▓▓▓▓▓▓▓▓▌
                        ¬`▀▀▀█▓

 Version information:
  ml-agents: 0.18.0,
  ml-agents-envs: 0.18.0,
  Communicator API: 1.0.0,
  TensorFlow: 2.2.0
2020-07-23 16:30:45.848914: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
WARNING:tensorflow:From d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2020-07-23 16:30:47 INFO [environment.py:199] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
2020-07-23 16:31:09 INFO [environment.py:108] Connected to Unity environment with package version 1.0.3 and communication version 1.0.0
2020-07-23 16:31:10 INFO [environment.py:265] Connected new brain:
3DBall?team=0
2020-07-23 16:31:10.033423: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-23 16:31:10.046351: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2c74563a0a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-23 16:31:10.054787: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-07-23 16:31:10.059908: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-07-23 16:31:10.228271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 15 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 238.66GiB/s
2020-07-23 16:31:10.237983: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-23 16:31:10.247545: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-23 16:31:10.256215: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-23 16:31:10.261961: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-23 16:31:10.271281: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-23 16:31:10.278897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-23 16:31:10.290463: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-23 16:31:10.301647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-07-23 16:31:10 WARNING [stats.py:235] Could not write text summary for Tensorboard.
2020-07-23 16:31:10 INFO [trainer_controller.py:76] Saved Model
Traceback (most recent call last):
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 130, in _create_trainer_and_manager
    trainer = self.trainers[brain_name]
KeyError: '3DBall'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\ML-Agents\mlagents-env\Scripts\mlagents-learn-script.py", line 33, in <module>
    sys.exit(load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')())
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\learn.py", line 283, in main
    run_cli(parse_command_line())
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\learn.py", line 279, in run_cli
    run_training(run_seed, options)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\learn.py", line 158, in run_training
    tc.start_learning(env_manager)
  File "d:\ml-agents\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 181, in start_learning
    self._create_trainers_and_managers(env_manager, new_behavior_ids)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 168, in _create_trainers_and_managers
    self._create_trainer_and_manager(env_manager, behavior_id)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 132, in _create_trainer_and_manager
    trainer = self.trainer_factory.generate(brain_name)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_util.py", line 52, in generate
    self.multi_gpu,
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_util.py", line 101, in initialize_trainer
    trainer_artifact_path,
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\ppo\trainer.py", line 48, in __init__
    brain_name, trainer_settings, training, artifact_path, reward_buff_cap
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer\rl_trainer.py", line 38, in __init__
    StatsPropertyType.HYPERPARAMETERS, self.trainer_settings.as_dict()
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\stats.py", line 321, in add_property
    writer.add_property(self.category, property_type, value)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\stats.py", line 216, in add_property
    self.summary_writers[category].add_summary(text, 0)
  File "d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\summary\writer\writer.py", line 127, in add_summary
    for value in summary.value:
AttributeError: 'str' object has no attribute 'value'

vincentpierre commented 4 years ago

Ok, this was to be expected. Can you checkout master and pull? I made changes on master that should fix the issue and will print some more details logs about why this is happening. Make sure to run pip3 install -e . as the -e will allow changes made on the repo to be reflected in your packages

qiwu57kevin commented 4 years ago

I have checkout master and pull. However, during handling, another error occured:

(mlagents-env) D:\ML-Agents\ml-agents\config>mlagents-learn ppo/3DBall.yaml --run-id=3DBall_test --force
2020-07-23 18:20:23.054329: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
WARNING:tensorflow:From d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term

                        ▄▄▄▓▓▓▓
                   ╓▓▓▓▓▓▓█▓▓▓▓▓
              ,▄▄▄m▀▀▀'  ,▓▓▓▀▓▓▄                           ▓▓▓  ▓▓▌
            ▄▓▓▓▀'      ▄▓▓▀  ▓▓▓      ▄▄     ▄▄ ,▄▄ ▄▄▄▄   ,▄▄ ▄▓▓▌▄ ▄▄▄    ,▄▄
          ▄▓▓▓▀        ▄▓▓▀   ▐▓▓▌     ▓▓▌   ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌  ╒▓▓▌
        ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓      ▓▀      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌   ▐▓▓▄ ▓▓▌
        ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄     ▓▓      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌    ▐▓▓▐▓▓
          ^█▓▓▓        ▀▓▓▄   ▐▓▓▌     ▓▓▓▓▄▓▓▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▓▄    ▓▓▓▓`
            '▀▓▓▓▄      ^▓▓▓  ▓▓▓       └▀▀▀▀ ▀▀ ^▀▀    `▀▀ `▀▀   '▀▀    ▐▓▓▌
               ▀▀▀▀▓▄▄▄   ▓▓▓▓▓▓,                                      ▓▓▓▓▀
                   `▀█▓▓▓▓▓▓▓▓▓▌
                        ¬`▀▀▀█▓

 Version information:
  ml-agents: 0.19.0.dev0,
  ml-agents-envs: 0.19.0.dev0,
  Communicator API: 1.0.0,
  TensorFlow: 2.2.0
2020-07-23 18:20:25.646658: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
WARNING:tensorflow:From d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2020-07-23 18:20:27 INFO [environment.py:199] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
2020-07-23 18:20:29 INFO [environment.py:108] Connected to Unity environment with package version 1.2.0-preview and communication version 1.0.0
2020-07-23 18:20:29 INFO [environment.py:265] Connected new brain:
3DBall?team=0
2020-07-23 18:20:29.467836: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-23 18:20:29.479564: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2c24bb370f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-23 18:20:29.485485: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-07-23 18:20:29.490874: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-07-23 18:20:29.654236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 15 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 238.66GiB/s
2020-07-23 18:20:29.662758: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-23 18:20:29.671851: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-23 18:20:29.680138: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-23 18:20:29.686172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-23 18:20:29.694907: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-23 18:20:29.702204: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-23 18:20:29.711580: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-23 18:20:29.720371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-07-23 18:20:29 WARNING [stats.py:239] Could not write Hyperparameters summary for Tensorboard: {'trainer_type': 'ppo', 'hyperparameters': {'batch_size': 64, 'buffer_size': 12000, 'learning_rate': 0.0003, 'beta': 0.001, 'epsilon': 0.2, 'lambd': 0.99, 'num_epoch': 3, 'learning_rate_schedule': 'linear'}, 'network_settings': {'normalize': True, 'hidden_units': 128, 'num_layers': 2, 'vis_encode_type': 'simple', 'memory': None}, 'reward_signals': {'extrinsic': {'gamma': 0.99, 'strength': 1.0}}, 'init_path': None, 'keep_checkpoints': 5, 'checkpoint_interval': 500000, 'max_steps': 500000, 'time_horizon': 1000, 'summary_freq': 12000, 'threaded': True, 'self_play': None, 'behavioral_cloning': None}
2020-07-23 18:20:29 INFO [stats.py:131] Hyperparameters for behavior name 3DBall:
        trainer_type:   ppo
        hyperparameters:
          batch_size:   64
          buffer_size:  12000
          learning_rate:        0.0003
          beta: 0.001
          epsilon:      0.2
          lambd:        0.99
          num_epoch:    3
          learning_rate_schedule:       linear
        network_settings:
          normalize:    True
          hidden_units: 128
          num_layers:   2
          vis_encode_type:      simple
          memory:       None
        reward_signals:
          extrinsic:
            gamma:      0.99
            strength:   1.0
        init_path:      None
        keep_checkpoints:       5
        checkpoint_interval:    500000
        max_steps:      500000
        time_horizon:   1000
        summary_freq:   12000
        threaded:       True
        self_play:      None
        behavioral_cloning:     None
2020-07-23 18:20:29.729412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 15 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 238.66GiB/s
2020-07-23 18:20:29.737810: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-23 18:20:29.742400: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-23 18:20:29.746056: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-23 18:20:29.750547: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-23 18:20:29.754533: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-23 18:20:29.759745: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-23 18:20:29.763789: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-23 18:20:29.771611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
Traceback (most recent call last):
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 177, in start_learning
    self._reset_env(env_manager)
  File "d:\ml-agents\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 115, in _reset_env
    self._register_new_behaviors(env_manager, env_manager.first_step_infos)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 271, in _register_new_behaviors
    self._create_trainers_and_managers(env_manager, new_behavior_ids)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 169, in _create_trainers_and_managers
    self._create_trainer_and_manager(env_manager, behavior_id)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 143, in _create_trainer_and_manager
    parsed_behavior_id, env_manager.training_behaviors[name_behavior_id]
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\ppo\trainer.py", line 205, in create_policy
    create_tf_graph=False,  # We will create the TF graph in the Optimizer
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\policy\tf_policy.py", line 89, in __init__
    config=tf_utils.generate_session_config(), graph=self.graph
  File "d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\client\session.py", line 1586, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "d:\ml-agents\mlagents-env\lib\site-packages\tensorflow\python\client\session.py", line 701, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\ML-Agents\mlagents-env\Scripts\mlagents-learn-script.py", line 33, in <module>
    sys.exit(load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')())
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\learn.py", line 284, in main
    run_cli(parse_command_line())
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\learn.py", line 280, in run_cli
    run_training(run_seed, options)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\learn.py", line 159, in run_training
    tc.start_learning(env_manager)
  File "d:\ml-agents\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 204, in start_learning
    self._save_models()
  File "d:\ml-agents\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 75, in _save_models
    self.trainers[brain_name].save_model()
  File "d:\ml-agents\ml-agents\ml-agents\mlagents\trainers\trainer\rl_trainer.py", line 133, in save_model
    policy = list(self.policies.values())[0]
IndexError: list index out of range

vincentpierre commented 4 years ago

Okay, I think the two issues are related. In the try/catch that was fixed on master, the culprit was probably the call to generate_session_config(). You can see the method fails as well this time :

tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

Which means this is a CUDA issue. The installation of ml-agents does not do any GPU setup, so my guess is that you have a CUDA configuration somewhere that causes this issue. You can try setting export CUDA_VISIBLE_DEVICES=-1 in the terminal to force TensorFlow to use the CPU or you can try to reinstall CUDA. Have you used CUDA in other projects ?

qiwu57kevin commented 4 years ago

I just noticed the CUDA error! Yes, it is the driver issue. I have updated the GPU driver to the latest and it can now get trained successfully!

Thank you so much for the help Vincent!

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Unity-Technologies / ml-agents

KeyError and AttributeError using MLAgents #4250