IntelLabs / coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
https://intellabs.github.io/coach/
Apache License 2.0
2.33k stars 461 forks source link

Saver fails to restore agent's checkpoint #465

Closed nicolas-cerardi closed 3 years ago

nicolas-cerardi commented 3 years ago

Hi, I'm using rl coach through AWS Sagemaker, and I'm running in an issue that I struggle to understand.

I'm performing RL using AWS Sagemaker for the learning, and AWS Robomaker for the environment, like in DeepRacer which uses rl coach as well. In fact, the code only little differs with the DeepRacer code on the learning side. But the environment is completely different though.

What happens:

The agent raises an exception with the message : Failed to restore agent's checkpoint: 'main_level/agent/main/online/global_step'

The traceback points to a bug happening in this rl coach module:

File "/someverylongpath/rl_coach/architectures/tensorflow_components/savers.py", line 93, in <dictcomp>
    for ph, v in zip(self._variable_placeholders, self._variables)
KeyError: 'main_level/agent/main/online/global_step'

Which I think testifies that in the function from_arrays, variables and self._variables do not contain the same variables...

So, there's a few things I don't understand about this problem, and I'm not used to rl coach so I think your point of view would be valuable.

  1. How can variables and self._variables be different ?
  2. Why does it fails only the second time ? (Does the graph manager change the computational graph ?)

A few more info:

nicolas-cerardi commented 3 years ago

deactivating the patch solves the issue.