Closed sterlingcrispin closed 5 years ago
Thanks for bringing this issue to our attention. @harper-u3d do you know why this might be happening?
@awjuliani I am not familiar with this problem. This seems like the key issue:
You need to supply the name of a node to --output_node_names.
I've created a PR for a potential fix here: https://github.com/Unity-Technologies/ml-agents/pull/2573
@sterlingcrispin this change is merged into our develop
branch and should make it into the next release. Feel free to give it a try.
Closing this issue since it's fixed on develop and in version 0.10.0
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Describe the bug Training with --multi-gpu causes a failure to save the model
To Reproduce Steps to reproduce the behavior:
mlagents-learn config/trainer_config.yaml --env="envs/CrawlerTest/UnityEnvironment.exe" --run-id=CrawlTest9 --multi-gpu --train
Screenshots `(ml-agents-0.9) C:\code\ml-agents-0.9.3>mlagents-learn config/trainer_config.yaml --env="envs/CrawlerTest/UnityEnvironment.exe" --run-id=CrawlTest9 --multi-gpu --train
INFO:mlagents.trainers:{'--base-port': '5005', '--curriculum': 'None', '--debug': False, '--docker-target-name': 'None', '--env': 'envs/CrawlerTest/UnityEnvironment.exe', '--help': False, '--keep-checkpoints': '5', '--lesson': '0', '--load': False, '--multi-gpu': True, '--no-graphics': False, '--num-envs': '1', '--num-runs': '1', '--run-id': 'CrawlTest9', '--sampler': 'None', '--save-freq': '50000', '--seed': '-1', '--slow': False, '--train': True, '': 'config/trainer_config.yaml'}
INFO:mlagents.envs:
'Academy' started successfully!
Unity Academy name: Academy
Number of Brains: 1
Number of Training Brains : 1
Reset Parameters :
Unity brain name: CrawlerDynamicLearning Number of Visual Observations (per agent): 0 Vector Observation space size (per agent): 126 Number of stacked Vector Observation: 1 Vector Action space type: continuous Vector Action space size (per agent): [20] Vector Action descriptions: , , , , , , , , , , , , , , , , , , , 2019-09-15 16:52:27.695371: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2019-09-15 16:52:28.424136: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1344] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:09:00.0 totalMemory: 11.00GiB freeMemory: 9.08GiB 2019-09-15 16:52:28.656036: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1344] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:43:00.0 totalMemory: 11.00GiB freeMemory: 9.08GiB 2019-09-15 16:52:28.886639: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1344] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:44:00.0 totalMemory: 11.00GiB freeMemory: 9.08GiB 2019-09-15 16:52:28.899020: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2 2019-09-15 16:52:30.682325: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-15 16:52:30.687212: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:917] 0 1 2 2019-09-15 16:52:30.692512: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0: N N N 2019-09-15 16:52:30.696421: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 1: N N N 2019-09-15 16:52:30.700478: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 2: N N N 2019-09-15 16:52:30.706373: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:0 with 8794 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1) 2019-09-15 16:52:31.116157: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:1 with 8794 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:43:00.0, compute capability: 6.1) 2019-09-15 16:52:31.523711: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:2 with 8794 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:44:00.0, compute capability: 6.1) 2019-09-15 16:52:31.932025: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2 2019-09-15 16:52:31.937238: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-15 16:52:31.944421: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:917] 0 1 2 2019-09-15 16:52:31.948231: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0: N N N 2019-09-15 16:52:31.953732: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 1: N N N 2019-09-15 16:52:31.957566: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 2: N N N 2019-09-15 16:52:31.963048: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8794 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1) 2019-09-15 16:52:31.975327: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8794 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:43:00.0, compute capability: 6.1) 2019-09-15 16:52:31.986450: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 8794 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:44:00.0, compute capability: 6.1) 2019-09-15 16:52:32.005864: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2 2019-09-15 16:52:32.010303: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-15 16:52:32.016621: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:917] 0 1 2 2019-09-15 16:52:32.021502: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0: N N N 2019-09-15 16:52:32.025453: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 1: N N N 2019-09-15 16:52:32.030767: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 2: N N N 2019-09-15 16:52:32.034855: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:0 with 8794 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1) 2019-09-15 16:52:32.045949: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:1 with 8794 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:43:00.0, compute capability: 6.1) 2019-09-15 16:52:32.056889: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/device:GPU:2 with 8794 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:44:00.0, compute capability: 6.1) INFO:mlagents.envs:Hyperparameters for the PPOTrainer of brain CrawlerDynamicLearning: trainer: ppo batch_size: 2024 beta: 0.005 buffer_size: 20240 epsilon: 0.2 hidden_units: 512 lambd: 0.95 learning_rate: 0.0003 max_steps: 1e6 memory_size: 256 normalize: True num_epoch: 3 num_layers: 3 time_horizon: 1000 sequence_length: 64 summary_freq: 3000 use_recurrent: False vis_encode_type: simple reward_signals: extrinsic: strength: 1.0 gamma: 0.995 summary_path: ./summaries/CrawlTest9_CrawlerDynamicLearning model_path: ./models/CrawlTest9-0/CrawlerDynamicLearning keep_checkpoints: 5 2019-09-15 16:52:35.868470: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2 2019-09-15 16:52:35.875198: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-15 16:52:35.882435: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:917] 0 1 2 2019-09-15 16:52:35.888822: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 0: N N N 2019-09-15 16:52:35.893658: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 1: N N N 2019-09-15 16:52:35.899985: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:930] 2: N N N 2019-09-15 16:52:35.904712: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8794 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1) 2019-09-15 16:52:35.918153: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8794 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:43:00.0, compute capability: 6.1) 2019-09-15 16:52:35.929854: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 8794 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:44:00.0, compute capability: 6.1) INFO:mlagents.trainers: CrawlTest9: CrawlerDynamicLearning: Step: 3000. Time Elapsed: 42.024 s Mean Reward: 2.771. Std of Reward: 25.870. Training. INFO:mlagents.trainers: CrawlTest9: CrawlerDynamicLearning: Step: 6000. Time Elapsed: 88.588 s Mean Reward: -3.250. Std of Reward: 22.548. Training. UnityEnvironment worker: environment stopping. INFO:mlagents.envs:Learning was interrupted. Please wait while the graph is generated. INFO:mlagents.envs:Saved Model INFO:mlagents.trainers:List of nodes to export for brain :CrawlerDynamicLearning You need to supply the name of a node to --output_node_names. Converting ./models/CrawlTest9-0/CrawlerDynamicLearning/frozen_graph_def.pb to ./models/CrawlTest9-0/CrawlerDynamicLearning.nn Traceback (most recent call last): File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\admin\AppData\Local\conda\conda\envs\ml-agents-0.9\Scripts\mlagents-learn.exe__main__.py", line 9, in
File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\learn.py", line 337, in main
run_training(0, run_seed, options, Queue())
File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\learn.py", line 132, in run_training
tc.start_learning(env)
File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\trainer_controller.py", line 219, in start_learning
self._export_graph()
File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\trainer_controller.py", line 129, in _export_graph
self.trainers[brain_name].export_model()
File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\trainer.py", line 152, in export_model
self.policy.export_model()
File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\tf_policy.py", line 234, in export_model
tf2bc.convert(self.model_path + "/frozen_graph_def.pb", self.model_path + ".nn")
File "c:\users\admin\appdata\local\conda\conda\envs\ml-agents-0.9\lib\site-packages\mlagents\trainers\tensorflow_to_barracuda.py", line 1537, in convert
f = open(source_file, "rb")
FileNotFoundError: [Errno 2] No such file or directory: './models/CrawlTest9-0/CrawlerDynamicLearning/frozen_graph_def.pb' `
Environment (please complete the following information):
NOTE: We are unable to help reproduce bugs with custom environments. Please attempt to reproduce your issue with one of the example environments, or provide a minimal patch to one of the environments needed to reproduce the issue.