Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
17.19k stars 4.16k forks source link

--multi-gpu causing failure to save model #2564

Closed sterlingcrispin closed 5 years ago

sterlingcrispin commented 5 years ago

Describe the bug Training with --multi-gpu causes a failure to save the model

To Reproduce Steps to reproduce the behavior:

  1. Build the crawler dynamic environment.
  2. Train it with mlagents-learn config/trainer_config.yaml --env="envs/CrawlerTest/UnityEnvironment.exe" --run-id=CrawlTest9 --multi-gpu --train
  3. Interrupt training with Ctrl+C

Screenshots `(ml-agents-0.9) C:\code\ml-agents-0.9.3>mlagents-learn config/trainer_config.yaml --env="envs/CrawlerTest/UnityEnvironment.exe" --run-id=CrawlTest9 --multi-gpu --train

                    ▄▄▄▓▓▓▓
               ╓▓▓▓▓▓▓█▓▓▓▓▓
          ,▄▄▄m▀▀▀'  ,▓▓▓▀▓▓▄                           ▓▓▓  ▓▓▌
        ▄▓▓▓▀'      ▄▓▓▀  ▓▓▓      ▄▄     ▄▄ ,▄▄ ▄▄▄▄   ,▄▄ ▄▓▓▌▄ ▄▄▄    ,▄▄
      ▄▓▓▓▀        ▄▓▓▀   ▐▓▓▌     ▓▓▌   ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌  ╒▓▓▌
    ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓      ▓▀      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌   ▐▓▓▄ ▓▓▌
    ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄     ▓▓      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌    ▐▓▓▐▓▓
      ^█▓▓▓        ▀▓▓▄   ▐▓▓▌     ▓▓▓▓▄▓▓▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▓▄    ▓▓▓▓`
        '▀▓▓▓▄      ^▓▓▓  ▓▓▓       └▀▀▀▀ ▀▀ ^▀▀    `▀▀ `▀▀   '▀▀    ▐▓▓▌
           ▀▀▀▀▓▄▄▄   ▓▓▓▓▓▓,                                      ▓▓▓▓▀
               `▀█▓▓▓▓▓▓▓▓▓▌
                    ¬`▀▀▀█▓

INFO:mlagents.trainers:{'--base-port': '5005', '--curriculum': 'None', '--debug': False, '--docker-target-name': 'None', '--env': 'envs/CrawlerTest/UnityEnvironment.exe', '--help': False, '--keep-checkpoints': '5', '--lesson': '0', '--load': False, '--multi-gpu': True, '--no-graphics': False, '--num-envs': '1', '--num-runs': '1', '--run-id': 'CrawlTest9', '--sampler': 'None', '--save-freq': '50000', '--seed': '-1', '--slow': False, '--train': True, '': 'config/trainer_config.yaml'} INFO:mlagents.envs: 'Academy' started successfully! Unity Academy name: Academy Number of Brains: 1 Number of Training Brains : 1 Reset Parameters :

Unity brain name: CrawlerDynamicLearning Number of Visual Observations (per agent): 0 Vector Observation space size (per agent): 126 Number of stacked Vector Observation: 1 Vector Action space type: continuous Vector Action space size (per agent): [20] Vector Action descriptions: , , , , , , , , , , , , , , , , , , , 2019-09-15 16:52:27.695371: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2019-09-15 16:52:28.424136: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1344] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:09:00.0 totalMemory: 11.00GiB freeMemory: 9.08GiB 2019-09-15 16:52:28.656036: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1344] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:43:00.0 totalMemory: 11.00GiB freeMemory: 9.08GiB 2019-09-15 16:52:28.886639: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1344] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:44:00.0 totalMemory: 11.00GiB freeMemory: 9.08GiB 2019-09-15 16:52:28.899020: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2 2019-09-15 16:52:30.682325: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-15 16:52:30.687212: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:917] 0 1 2 2019-09-15 16:52:30.692512: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0: N N N 2019-09-15 16:52:30.696421: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 1: N N N 2019-09-15 16:52:30.700478: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 2: N N N 2019-09-15 16:52:30.706373: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:0 with 8794 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1) 2019-09-15 16:52:31.116157: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:1 with 8794 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:43:00.0, compute capability: 6.1) 2019-09-15 16:52:31.523711: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:2 with 8794 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:44:00.0, compute capability: 6.1) 2019-09-15 16:52:31.932025: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2 2019-09-15 16:52:31.937238: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-15 16:52:31.944421: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:917] 0 1 2 2019-09-15 16:52:31.948231: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0: N N N 2019-09-15 16:52:31.953732: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 1: N N N 2019-09-15 16:52:31.957566: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 2: N N N 2019-09-15 16:52:31.963048: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8794 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1) 2019-09-15 16:52:31.975327: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8794 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:43:00.0, compute capability: 6.1) 2019-09-15 16:52:31.986450: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 8794 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:44:00.0, compute capability: 6.1) 2019-09-15 16:52:32.005864: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2 2019-09-15 16:52:32.010303: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-15 16:52:32.016621: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:917] 0 1 2 2019-09-15 16:52:32.021502: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0: N N N 2019-09-15 16:52:32.025453: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 1: N N N 2019-09-15 16:52:32.030767: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 2: N N N 2019-09-15 16:52:32.034855: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:0 with 8794 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1) 2019-09-15 16:52:32.045949: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:1 with 8794 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:43:00.0, compute capability: 6.1) 2019-09-15 16:52:32.056889: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:2 with 8794 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:44:00.0, compute capability: 6.1) INFO:mlagents.envs:Hyperparameters for the PPOTrainer of brain CrawlerDynamicLearning: trainer: ppo batch_size: 2024 beta: 0.005 buffer_size: 20240 epsilon: 0.2 hidden_units: 512 lambd: 0.95 learning_rate: 0.0003 max_steps: 1e6 memory_size: 256 normalize: True num_epoch: 3 num_layers: 3 time_horizon: 1000 sequence_length: 64 summary_freq: 3000 use_recurrent: False vis_encode_type: simple reward_signals: extrinsic: strength: 1.0 gamma: 0.995 summary_path: ./summaries/CrawlTest9_CrawlerDynamicLearning model_path: ./models/CrawlTest9-0/CrawlerDynamicLearning keep_checkpoints: 5 2019-09-15 16:52:35.868470: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2 2019-09-15 16:52:35.875198: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-15 16:52:35.882435: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:917] 0 1 2 2019-09-15 16:52:35.888822: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0: N N N 2019-09-15 16:52:35.893658: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 1: N N N 2019-09-15 16:52:35.899985: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 2: N N N 2019-09-15 16:52:35.904712: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8794 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1) 2019-09-15 16:52:35.918153: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8794 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:43:00.0, compute capability: 6.1) 2019-09-15 16:52:35.929854: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 8794 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:44:00.0, compute capability: 6.1) INFO:mlagents.trainers: CrawlTest9: CrawlerDynamicLearning: Step: 3000. Time Elapsed: 42.024 s Mean Reward: 2.771. Std of Reward: 25.870. Training. INFO:mlagents.trainers: CrawlTest9: CrawlerDynamicLearning: Step: 6000. Time Elapsed: 88.588 s Mean Reward: -3.250. Std of Reward: 22.548. Training. UnityEnvironment worker: environment stopping. INFO:mlagents.envs:Learning was interrupted. Please wait while the graph is generated. INFO:mlagents.envs:Saved Model INFO:mlagents.trainers:List of nodes to export for brain :CrawlerDynamicLearning You need to supply the name of a node to --output_node_names. Converting ./models/CrawlTest9-0/CrawlerDynamicLearning/frozen_graph_def.pb to ./models/CrawlTest9-0/CrawlerDynamicLearning.nn Traceback (most recent call last): File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\admin\AppData\Local\conda\conda\envs\ml-agents-0.9\Scripts\mlagents-learn.exe__main__.py", line 9, in File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\learn.py", line 337, in main run_training(0, run_seed, options, Queue()) File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\learn.py", line 132, in run_training tc.start_learning(env) File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\trainer_controller.py", line 219, in start_learning self._export_graph() File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\trainer_controller.py", line 129, in _export_graph self.trainers[brain_name].export_model() File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\trainer.py", line 152, in export_model self.policy.export_model() File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\tf_policy.py", line 234, in export_model tf2bc.convert(self.model_path + "/frozen_graph_def.pb", self.model_path + ".nn") File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\tensorflow_to_barracuda.py", line 1537, in convert f = open(source_file, "rb") FileNotFoundError: [Errno 2] No such file or directory: './models/CrawlTest9-0/CrawlerDynamicLearning/frozen_graph_def.pb'`

Environment (please complete the following information):

NOTE: We are unable to help reproduce bugs with custom environments. Please attempt to reproduce your issue with one of the example environments, or provide a minimal patch to one of the environments needed to reproduce the issue.

awjuliani commented 5 years ago

Thanks for bringing this issue to our attention. @harper-u3d do you know why this might be happening?

harperj commented 5 years ago

@awjuliani I am not familiar with this problem. This seems like the key issue:

You need to supply the name of a node to --output_node_names.

harperj commented 5 years ago

I've created a PR for a potential fix here: https://github.com/Unity-Technologies/ml-agents/pull/2573

harperj commented 5 years ago

@sterlingcrispin this change is merged into our develop branch and should make it into the next release. Feel free to give it a try.

chriselion commented 5 years ago

Closing this issue since it's fixed on develop and in version 0.10.0

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.