Exploration access to environment for forward simulation

IntelLabs / coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms

https://intellabs.github.io/coach/

Apache License 2.0

2.32k stars 460 forks source link

Exploration access to environment for forward simulation #237

Open redknightlois opened 5 years ago

redknightlois commented 5 years ago

Hi,

I stumbled upon the following potential improvement, I am hacking it right now, but it would be great to have a proper solution. MCTS and other forward simulation techniques must have access to clones of the environment to execute rollouts. There is no way to pass the Exploration Policies the actual instantiated environment so they can perform the forward search.

For the purpose of illustration, this is the hack:

graph_manager.verify_graph_was_created()
env = graph_manager.environments[0]
graph_manager.top_level_manager.agents['agent'].exploration_policy.set_environment(env)

Being able to pass the instantiated environment as suggested in https://github.com/NervanaSystems/coach/issues/212 would be a potential workaround although not a solution.

gal-leibovich commented 5 years ago

@guyk1971

gal-leibovich commented 5 years ago

We have purposefully encapsulated the environment and have hidden it from the agent. All the interaction between the two is managed through the level manager. The goal was to allow for more complex scenarios than standard RL, such as Hierarchical Reinforcement Learning, self-play, multi-agent RL, etc.

@guyk1971 is also looking into adding MCTS support to Coach. He might have more insights to share here. If we can limit the agent's access to the environment, that might be preferred (from SW encapsulation perspective, and in order to increase the framework robustness).

I think that in #212, passing the instantiated environment is referring to initializing the environment outside of Coach. So it still wouldn't be available to the agent or to the exploration policy. But, I might be wrong.

redknightlois commented 5 years ago

@galleibo-intel @guyk1971 I stumbled across the Go-Explore paper (https://arxiv.org/pdf/1901.10995.pdf) you should seriously take a look into it in the context of supporting entirely different scenarios. The workflow is so different to anything that is available on Couch that if you devise a way to make that work, all the rest are going to be pretty easy to implement on top of it.

EDIT: In the same direction the POET paper can give some other hints of operators on training workflows. https://arxiv.org/abs/1901.01753

gal-leibovich commented 5 years ago

Thanks @redknightlois. At the moment we do not have plans for big scale architectural framework changes.