majority of current HRL methods require careful task-specific design and on-policy training, making them difficult to apply in real-world scenarios
This is because the changing behavior of the lower-level policy creates a non-stationary problem for the higher-level policy, and old off-policy experience may exhibit different transitions conditioned on the same goals (section 3.3)
Innovation/Contribution:
we propose to use off-policy experience for both higher and lower-level training. This poses a considerable challenge, since changes to the lower-level behaviors change the action space for the higher-level policy, and we introduce an off-policy correction to remedy this challenge
Also see section 3.3
Conclusion:
Our experiments show that HIRO can be used to learn highly complex behaviors for simulated robots, such as pushing objects and utilizing them to reach target locations, learning from only a few million samples, equivalent to a few days of real-time interaction. In comparisons with a number of prior HRL methods, we find that our approach substantially outperforms previous state-of-the-art techniques.
Comments:
This is one of the important HRL paper that achieved SOTA results. Author proposed a new algorithm called "maximum likelihood-based action relabeling". It uses DDPG/TD3 as baseline algorithm and mostly work on continuous action space.
But some old papers referenced are using discrete action space and I think extend to
Link: arxiv
Problem:
Innovation/Contribution:
Also see section 3.3
Conclusion:
Comments: This is one of the important HRL paper that achieved SOTA results. Author proposed a new algorithm called "maximum likelihood-based action relabeling". It uses DDPG/TD3 as baseline algorithm and mostly work on continuous action space. But some old papers referenced are using discrete action space and I think extend to