Data-Efficient Hierarchical Reinforcement Learning By: Ofir Nachum, Shixiang Gu, Honglak Lee, Sergey Levine

Problem:

majority of current HRL methods require careful task-specific design and on-policy training, making them difficult to apply in real-world scenarios

This is because the changing behavior of the lower-level policy creates a non-stationary problem for the higher-level policy, and old off-policy experience may exhibit different transitions conditioned on the same goals (section 3.3)

Innovation/Contribution:

we propose to use off-policy experience for both higher and lower-level training. This poses a considerable challenge, since changes to the lower-level behaviors change the action space for the higher-level policy, and we introduce an off-policy correction to remedy this challenge

Also see section 3.3

Conclusion:

Our experiments show that HIRO can be used to learn highly complex behaviors for simulated robots, such as pushing objects and utilizing them to reach target locations, learning from only a few million samples, equivalent to a few days of real-time interaction. In comparisons with a number of prior HRL methods, we find that our approach substantially outperforms previous state-of-the-art techniques.

Comments: This is one of the important HRL paper that achieved SOTA results. Author proposed a new algorithm called "maximum likelihood-based action relabeling". It uses DDPG/TD3 as baseline algorithm and mostly work on continuous action space. But some old papers referenced are using discrete action space and I think extend to

QiXuanWang / LearningFromTheBest

Data-Efficient Hierarchical Reinforcement Learning By: Ofir Nachum, Shixiang Gu, Honglak Lee, Sergey Levine #12