Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
16.93k stars 4.14k forks source link

no-graphics and mutli-gpu together causes model saving failure #2563

Closed sterlingcrispin closed 5 years ago

sterlingcrispin commented 5 years ago

Describe the bug I know this seems weird because the whole advantage of flagging -multi-gpu would be to gain performance on visual observations but I noticed if --no-graphics and --multi-gpu are turned on the model never saves

To Reproduce Steps to reproduce the behavior (tested on windows):

  1. build an environment
  2. run training with --no-graphics --multi-gpu
  3. let some steps go by and interrupt with ctrl-c
  4. failure to save model

Screenshots

`(ml-agents-0.9) C:\code\Unity projects\machineLearning\SpotWalking-0.5.0>mlagents-learn config/trainer_config.yaml --env="envs/CrawlerTest/Unity Environment.exe" --run-id=CrawlTest6 --no-graphics --train --multi-gpu

                    ▄▄▄▓▓▓▓
               ╓▓▓▓▓▓▓█▓▓▓▓▓
          ,▄▄▄m▀▀▀'  ,▓▓▓▀▓▓▄                           ▓▓▓  ▓▓▌
        ▄▓▓▓▀'      ▄▓▓▀  ▓▓▓      ▄▄     ▄▄ ,▄▄ ▄▄▄▄   ,▄▄ ▄▓▓▌▄ ▄▄▄    ,▄▄
      ▄▓▓▓▀        ▄▓▓▀   ▐▓▓▌     ▓▓▌   ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌  ╒▓▓▌
    ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓      ▓▀      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌   ▐▓▓▄ ▓▓▌
    ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄     ▓▓      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌    ▐▓▓▐▓▓
      ^█▓▓▓        ▀▓▓▄   ▐▓▓▌     ▓▓▓▓▄▓▓▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▓▄    ▓▓▓▓`
        '▀▓▓▓▄      ^▓▓▓  ▓▓▓       └▀▀▀▀ ▀▀ ^▀▀    `▀▀ `▀▀   '▀▀    ▐▓▓▌
           ▀▀▀▀▓▄▄▄   ▓▓▓▓▓▓,                                      ▓▓▓▓▀
               `▀█▓▓▓▓▓▓▓▓▓▌
                    ¬`▀▀▀█▓

INFO:mlagents.trainers:{'--base-port': '5005', '--curriculum': 'None', '--debug': False, '--docker-target-name': 'None', '--env': 'envs/CrawlerTest/Unity Environment.exe', '--help': False, '--keep-checkpoints': '5', '--lesson': '0', '--load': False, '--multi-gpu': True, '--no-graphics': True, '--num-envs': '1', '--num-runs': '1', '--run-id': 'CrawlTest6', '--sampler': 'None', '--save-freq': '50000', '--seed': '-1', '--slow': False, '--train': True, '': 'config/trainer_config.yaml'} INFO:mlagents.envs: 'Academy' started successfully! Unity Academy name: Academy Number of Brains: 1 Number of Training Brains : 1 Reset Parameters :

Unity brain name: CrawlerDynamicLearning Number of Visual Observations (per agent): 0 Vector Observation space size (per agent): 126 Number of stacked Vector Observation: 1 Vector Action space type: continuous Vector Action space size (per agent): [20] Vector Action descriptions: , , , , , , , , , , , , , , , , , , , 2019-09-15 15:57:45.321659: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2019-09-15 15:57:46.068788: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1344] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:09:00.0 totalMemory: 11.00GiB freeMemory: 9.08GiB 2019-09-15 15:57:46.290381: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1344] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:43:00.0 totalMemory: 11.00GiB freeMemory: 9.08GiB 2019-09-15 15:57:46.518177: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1344] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:44:00.0 totalMemory: 11.00GiB freeMemory: 9.08GiB 2019-09-15 15:57:46.529870: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2 2019-09-15 15:57:48.343105: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-15 15:57:48.348055: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:917] 0 1 2 2019-09-15 15:57:48.352044: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0: N N N 2019-09-15 15:57:48.357770: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 1: N N N 2019-09-15 15:57:48.362298: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 2: N N N 2019-09-15 15:57:48.368410: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:0 with 8794 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1) 2019-09-15 15:57:48.762680: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:1 with 8794 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:43:00.0, compute capability: 6.1) 2019-09-15 15:57:49.147679: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:2 with 8794 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:44:00.0, compute capability: 6.1) 2019-09-15 15:57:49.549166: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2 2019-09-15 15:57:49.554490: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-15 15:57:49.561599: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:917] 0 1 2 2019-09-15 15:57:49.566284: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0: N N N 2019-09-15 15:57:49.572102: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 1: N N N 2019-09-15 15:57:49.576768: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 2: N N N 2019-09-15 15:57:49.583012: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8794 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1) 2019-09-15 15:57:49.596411: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8794 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:43:00.0, compute capability: 6.1) 2019-09-15 15:57:49.610391: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 8794 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:44:00.0, compute capability: 6.1) 2019-09-15 15:57:49.631881: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2 2019-09-15 15:57:49.636746: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-15 15:57:49.643708: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:917] 0 1 2 2019-09-15 15:57:49.648168: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0: N N N 2019-09-15 15:57:49.654113: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 1: N N N 2019-09-15 15:57:49.659667: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 2: N N N 2019-09-15 15:57:49.664312: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:0 with 8794 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1) 2019-09-15 15:57:49.676243: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:1 with 8794 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:43:00.0, compute capability: 6.1) 2019-09-15 15:57:49.686494: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:2 with 8794 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:44:00.0, compute capability: 6.1) INFO:mlagents.envs:Hyperparameters for the PPOTrainer of brain CrawlerDynamicLearning: trainer: ppo batch_size: 2024 beta: 0.005 buffer_size: 20240 epsilon: 0.2 hidden_units: 512 lambd: 0.95 learning_rate: 0.0003 max_steps: 1e6 memory_size: 256 normalize: True num_epoch: 3 num_layers: 3 time_horizon: 1000 sequence_length: 64 summary_freq: 3000 use_recurrent: False vis_encode_type: simple reward_signals: extrinsic: strength: 1.0 gamma: 0.995 summary_path: ./summaries/CrawlTest6_CrawlerDynamicLearning model_path: ./models/CrawlTest6-0/CrawlerDynamicLearning keep_checkpoints: 5 2019-09-15 15:57:53.514177: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2 2019-09-15 15:57:53.519387: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-15 15:57:53.526092: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:917] 0 1 2 2019-09-15 15:57:53.531494: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0: N N N 2019-09-15 15:57:53.535944: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 1: N N N 2019-09-15 15:57:53.541002: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 2: N N N 2019-09-15 15:57:53.545060: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8794 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1) 2019-09-15 15:57:53.555915: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8794 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:43:00.0, compute capability: 6.1) 2019-09-15 15:57:53.566523: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 8794 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:44:00.0, compute capability: 6.1) INFO:mlagents.trainers: CrawlTest6: CrawlerDynamicLearning: Step: 3000. Time Elapsed: 45.407 s Mean Reward: 11.605. Std of Reward: 22.726. Training. INFO:mlagents.trainers: CrawlTest6: CrawlerDynamicLearning: Step: 6000. Time Elapsed: 89.113 s Mean Reward: 12.040. Std of Reward: 28.580. Training. INFO:mlagents.trainers: CrawlTest6: CrawlerDynamicLearning: Step: 9000. Time Elapsed: 133.571 s Mean Reward: 17.884. Std of Reward: 32.556. Training. INFO:mlagents.trainers: CrawlTest6: CrawlerDynamicLearning: Step: 12000. Time Elapsed: 177.089 s Mean Reward: 15.658. Std of Reward: 35.743. Training. UnityEnvironment worker: environment stopping. INFO:mlagents.envs:Learning was interrupted. Please wait while the graph is generated. INFO:mlagents.envs:Saved Model INFO:mlagents.trainers:List of nodes to export for brain :CrawlerDynamicLearning You need to supply the name of a node to --output_node_names. Converting ./models/CrawlTest6-0/CrawlerDynamicLearning/frozen_graph_def.pb to ./models/CrawlTest6-0/CrawlerDynamicLearning.nn Traceback (most recent call last): File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\admin\AppData\Local\conda\conda\envs\ml-agents-0.9\Scripts\mlagents-learn.exe__main__.py", line 9, in File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\learn.py", line 337, in main run_training(0, run_seed, options, Queue()) File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\learn.py", line 132, in run_training tc.start_learning(env) File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\trainer_controller.py", line 219, in start_learning self._export_graph() File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\trainer_controller.py", line 129, in _export_graph self.trainers[brain_name].export_model() File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\trainer.py", line 152, in export_model self.policy.export_model() File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\tf_policy.py", line 234, in export_model tf2bc.convert(self.model_path + "/frozen_graph_def.pb", self.model_path + ".nn") File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\tensorflow_to_barracuda.py", line 1537, in convert f = open(source_file, "rb") FileNotFoundError: [Errno 2] No such file or directory: './models/CrawlTest6-0/CrawlerDynamicLearning/frozen_graph_def.pb' `

Environment (please complete the following information):

sterlingcrispin commented 5 years ago

Actually it seems like multi-gpu is failing on my system closing this to open a clearer issue

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.