ARCC-RACE / deepracer-for-dummies

a quick way to get up and running with local deepracer training environment
66 stars 28 forks source link

Allocator (GPU_0_bfc) ran out of memory trying to allocate 520.77MiB. #65

Open shimin-happy opened 4 years ago

shimin-happy commented 4 years ago

I have dell xps 15 with GPU gtx 1060, always get error when I train. I have reduced the batch size to 16, but does not help. Anybody has similar experiences?

2020-01-16 21:16:39.457655: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 520.77MiB. Current allocation summary follows. 2020-01-16 21:16:39.457938: W tensorflow/core/common_runtime/bfcallocator.cc:275] **__**____**__*_____*** 2020-01-16 21:16:39.457976: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at conv_ops.cc:693 : Resource exhausted: OOM when allocating tensor with shape[3772,32,29,39] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc {"simapp_exception": {"version": "1.0", "date": "2020-01-16 21:16:39.881604", "function": "training_worker", "message": "An error occured while training: OOM when allocating tensor with shape[3772,32,29,39] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc\n\t [[{node main_level/agent/main/online/network_1/observation/Conv2d_0/Conv2D} = Conv2D[T=DT_FLOAT, data_format=\"NCHW\", dilations=[1, 1, 1, 1], padding=\"VALID\", strides=[1, 1, 4, 4], use_cudnn_on_gpu=true, _device=\"/job:localhost/replica:0/task:0/device:GPU:0\"](main_level/agent/main/online/network_0/observation/Conv2d_0/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, main_level/agent/main/online/network_1/observation/Conv2d_0/kernel/read)]]\nHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.\n\n\nCaused by op 'main_level/agent/main/online/network_1/observation/Conv2d_0/Conv2D', defined at:\n File \"training_worker.py\", line 252, in \n main()\n File \"training_worker.py\", line 247, in main\n memory_backend_params=memory_backend_params\n File \"training_worker.py\", line 68, in training_worker\n graph_manager.create_graph(task_parameters)\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/graph_managers/graph_manager.py\", line 144, in create_graph\n self.level_managers, self.environments = self._create_graph(task_parameters)\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/graph_managers/basic_rl_graph_manager.py\", line 59, in _create_graph\n level_manager = LevelManager(agents=agent, environment=env, name=\"main_level\")\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/level_manager.py\", line 88, in init\n self.build()\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/level_manager.py\", line 152, in build\n [agent.set_environment_parameters(spaces) for agent in self.agents.values()]\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/level_manager.py\", line 152, in \n [agent.set_environment_parameters(spaces) for agent in self.agents.values()]\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/agents/agent.py\", line 301, in set_environment_parameters\n self.init_environment_dependent_modules()\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/agents/agent.py\", line 346, in init_environment_dependent_modules\n self.networks = self.create_networks()\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/agents/agent.py\", line 319, in create_networks\n worker_device=self.worker_device)\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/network_wrapper.py\", line 93, in init\n network_is_trainable=True)\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/general_network.py\", line 74, in construct\n return construct_on_device()\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/general_network.py\", line 59, in construct_on_device\n return GeneralTensorFlowNetwork(*args, kwargs)\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/general_network.py\", line 126, in init\n network_is_local, network_is_trainable)\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/architecture.py\", line 105, in init\n self.get_model()\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/general_network.py\", line 261, in get_model\n input_placeholder, embedding = input_embedder(self.inputs[input_name])\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/embedders/embedder.py\", line 88, in call\n self._build_module()\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/embedders/embedder.py\", line 115, in _build_module\n is_training=self.is_training)\n File \"/usr/local/lib/python3.6/dist-packages/rl_coach/architectures/tensorflow_components/layers.py\", line 78, in call\n strides=self.strides, data_format='channels_last', name=name)\n File \"/usr/local/lib/python3.6/dist-packages/tensorflow/python/layers/convolutional.py\", line 417, in conv2d\n return layer.apply(inputs)\n File \"/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py\", line 828, in apply\n return self.call(inputs, *args, *kwargs)\n File \"/usr/local/lib/python3.6/dist-packages/tensorflow/python/layers/base.py\", line 364, in call\n outputs = super(Layer, self).call(inputs, args, kwargs)\n File \"/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py\", line 769, in call\n outputs = self.call(inputs, *args, *kwargs)\n File \"/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/convolutional.py\", line 186, in call\n outputs = self._convolution_op(inputs, self.kernel)\n File \"/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py\", line 869, in call\n return self.conv_op(inp, filter)\n File \"/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py\", line 521, in call\n return self.call(inp, filter)\n File \"/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py\", line 205, in call\n name=self.name)\n File \"/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_nn_ops.py\", line 957, in conv2d\n data_format=data_format, dilations=dilations, name=name)\n File \"/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py\", line 787, in _apply_op_helper\n op_def=op_def)\n File \"/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py\", line 488, in new_func\n return func(args, *kwargs)\n File \"/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py\", line 3272, in create_op\n op_def=op_def)\n File \"/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py\", line 1768, in init\n self._traceback = tf_stack.extract_stack()\n\nResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3772,32,29,39] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc\n\t [[{node main_level/agent/main/online/network_1/observation/Conv2d_0/Conv2D} = Conv2D[T=DT_FLOAT, data_format=\"NCHW\", dilations=[1, 1, 1, 1], padding=\"VALID\", strides=[1, 1, 4, 4], use_cudnn_on_gpu=true, _device=\"/job:localhost/replica:0/task:0/device:GPU:0\"](main_level/agent/main/online/network_0/observation/Conv2d_0/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, main_level/agent/main/online/network_1/observation/Conv2d_0/kernel/read)]]\nHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.\n\n. Job failed!.", "exceptionType": "training_worker.exceptions", "eventType": "system_error", "errorCode": "503"}} Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1292, in _do_call return fn(args) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1277, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3772,32,29,39] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node main_level/agent/main/online/network_1/observation/Conv2d_0/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 4, 4], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](main_level/agent/main/online/network_0/observation/Conv2d_0/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, main_level/agent/main/online/network_1/observation/Conv2d_0/kernel/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.