alex-petrenko / landmark-exploration

Attempt to develop a new RL algorithm for hard exploration problems
MIT License
3 stars 1 forks source link

Research: graph pruning and reset policy #25

Open alex-petrenko opened 5 years ago

alex-petrenko commented 5 years ago

This mostly applies to training on a single environment. We need to both learn an exploration policy (that would know how to expand the graph), but we also need to make progress so we should not reset the graph too often.

gautams3 commented 5 years ago

When should we do the graph reset? Is this for a kidnapped-robot situation?

alex-petrenko commented 5 years ago

Ok, let me explain the problem. There are two possible types of problems that you might want to solve with this kind of algorithm.

  1. You want to learn a generic exploration behavior and have access to a large number of similar environments sampled from the same distribution. In this setting, you want to reset your graph after every episode, because every environment you sample is different, therefore the map from the previous episode is useless.

  2. Another type of problem is when you have a single big (unchanging) environment that you want to explore (like Montezuma's Revenge). In this case, you want to mostly preserve your graph from previous episodes such that you can use it to navigate to exploration frontier and start exploring right away. But in our approach you also train the distance metric (that defines the graph) together with exploration policy. As distance metric gets better you might want to rebuild some parts of your graph from scratch.

A specific example: When you just start exploring your initial policy is mostly random, and since the distance metric is trained on your own experience, the observations that are labeled "far away" will actually be quite close in the environment (just because random policy struggles to reach anything really). Then as you train your policy to reach these landmarks they will actually become easily reachable. So you might want to rebuild from scratch using the new distance metric. Also, if you start with empty graph sometimes, you might learn general exploration behavior that might be useful later on, when you discover other parts of the environment.

Overall, this is an open question, I don't really know what the best policy is. We should discuss and experiment.