germain-hug / Deep-RL-Keras

Keras Implementation of popular Deep RL Algorithms (A3C, DDQN, DDPG, Dueling DDQN)
533 stars 149 forks source link

ResourceExhaustedError #12

Open OversightAI opened 5 years ago

OversightAI commented 5 years ago

Is there a workaround for the ResourceExhaustedError?

That's what happen when I run main.py with a custom env:

Traceback (most recent call last):
  File "main.py", line 125, in <module>
    main()
  File "main.py", line 103, in main
    stats = algo.train(env, args, summary_writer)
  File "[...]\Deep-RL-Keras\A2C\a2c.py", line 100, in train
    self.train_models(states, actions, rewards, done)
  File "[...]\Deep-RL-Keras\A2C\a2c.py", line 67, in train_models
    self.c_opt([states, discounted_rewards])
  File "[...]\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "[...]\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "[...]\lib\site-packages\tensorflow\python\client\session.py", line 1439, in __call__
    run_metadata_ptr)
  File "[...]\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[177581,177581] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[{{node sub_17}} = Sub[T=DT_FLOAT, _class=["loc:@gradients_1/sub_17_grad/Reshape_1"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_Placeholder_2_0_1, dense_6/BiasAdd)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
germain-hug commented 5 years ago

Hi, I am not familiar with this error, but it does seem like you are dealing with very large tensors ([177581,177581]), have you tried narrowing down where this tensor comes from? Also playing with the batch-size and input size should help.

OversightAI commented 5 years ago

Hi, thanks for the response. What do you mean with input size? I tried to use a lower batch-size but the error still orccurs. I will have a closer look at the project later on.

germain-hug commented 5 years ago

Hi, Apologies about the late reply! It seems like the size of your environment / state is very large and causes the network to produce some very large tensors at some point. You could try checking where this large tensor comes from, and optionally changing some network parameters