ADVANTAGE-WEIGHTED REGRESSION: SIMPLE AND SCALABLE OFF-POLICY REINFORCEMENT LEARNING

Link: https://openreview.net/pdf?id=ToWi1RjuEr8 Link2: https://openreview.net/forum?id=H1gdF34FvS

Author: Xue Bin Peng, Aviral Kumar, Grace Zhang, Sergey Levine

submitted to ICLR 2021/ ICLR 2020. Got a lot of discussion on it's novelty.

Problem:

Arguably the simplest reinforcement learning methods are policy gradient algorithms (Sutton et al., 2000), which directly differentiate the expected return and perform gradient ascent. Unfortunately, these methods can be notoriously unstable and are typically on-policy, often requiring a substantial number of samples to learn effective behaviors

Innovation:

In this work, we propose advantage-weighted regression (AWR), a simple off-policy algorithm for model-free RL. Each iteration of the AWR algorithm simply consists of two supervised regression steps: one for training a value function baseline via regression onto cumulative rewards, and another for training the policy via weighted regression

. AWR is also able to learn from fully off-policy datasets, demonstrating comparable performance to state-of-the-art off-policy methods. While AWR is effective for a diverse suite of tasks, it is not yet as sample efficient as the most efficient off-policy algorithms. We believe that exploring techniques for improving sample efficiency and performance on fully off-policy learning can open opportunities to deploy these methods in real world domains.

Then the M-step projects π∗ onto the space ofparameterized policies by solving a supervised regression problem:

The RWR update can be interpreted as fitting a new policy πk+1 to samples from the current policy πk, where the likelihood of each action is weighted by the exponentiated return for that action

Some good reviews:

-- Even though this paper has done a good job in terms of running different experiments, the selection of some of the benchmarks seems arbitrary. For example, for discrete action space, this paper uses LunarLander which is rarely used in any papers so it makes very difficult to draw a conclusion based on these results. Common 49 Atari-2600 games should have been used for comparison. The same thing about experiments in section 5.3 is true too as those tasks are not that well-known.

-- The proposed method doesn't outperform previous off-policy methods on Mujoco task (Table 1). Since the main claim of this paper is a new off-policy method, outperforming the previous off-policy methods is a fair game. The current results are not convincing enough.

-- There are significant overlaps between this paper, "Fitted Q-iteration by Advantage Weighted Regression", "Model-Free Preference-Based Reinforcement Learning ", and "Reinforcement learning by reward-weighted regression for operational space control" which makes the contribution of this paper very incremental.

-- The authors used only 5 seeds to run Mujoco experiments. Given the sensitivity of Mujoco for different starting points, the experiments should have been run at least with 10 different seeds.

QiXuanWang / LearningFromTheBest

ADVANTAGE-WEIGHTED REGRESSION: SIMPLE AND SCALABLE OFF-POLICY REINFORCEMENT LEARNING #48