This is published on 27 Feb 2019 so it is earlier than UCB MBRL benchmark paper #5 and also cited by " When to Trust Your Model: Model-Based Policy Optimization" #9 paper.
Problem:
Model-based reinforcement learning (RL) is considered to be a promising approach to reduce the sample complexity that hinders model-free RL. However, the theoretical understanding of such methods has been rather limited
Innovation:
This paper introduces a novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees. We design a meta-algorithm with a theoretical guarantee of monotone improvement to a local maximum of the expected reward. The meta-algorithm iteratively builds a lower bound of the expected reward based on the estimated dynamical model and sample trajectories, and then maximizes the lower bound jointly over the policy and the model.
Comment:
Review1: The paper proposed a framework to design model-based RL algorithms. The framework is based on OFU and within this framework the authors develop an algorithm (a variant of SLBO) achieving SOTA performance on MuJoCo tasks.
Response2: Indeed, our framework can capture all parameterized models (including linear model or even tabular MDP); however, our focus is on non-linear models. The distinction to the previous papers is that we are the first framework that can show the monotone improvement and handle the uncertainty quantification (via a discrepancy bound) for non-linear models.
Link: OpenReview Code: https://github.com/roosephu/slbo
This is published on 27 Feb 2019 so it is earlier than UCB MBRL benchmark paper #5 and also cited by " When to Trust Your Model: Model-Based Policy Optimization" #9 paper.
Problem: Model-based reinforcement learning (RL) is considered to be a promising approach to reduce the sample complexity that hinders model-free RL. However, the theoretical understanding of such methods has been rather limited
Innovation: This paper introduces a novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees. We design a meta-algorithm with a theoretical guarantee of monotone improvement to a local maximum of the expected reward. The meta-algorithm iteratively builds a lower bound of the expected reward based on the estimated dynamical model and sample trajectories, and then maximizes the lower bound jointly over the policy and the model.
Comment: Review1: The paper proposed a framework to design model-based RL algorithms. The framework is based on OFU and within this framework the authors develop an algorithm (a variant of SLBO) achieving SOTA performance on MuJoCo tasks.
Response2: Indeed, our framework can capture all parameterized models (including linear model or even tabular MDP); however, our focus is on non-linear models. The distinction to the previous papers is that we are the first framework that can show the monotone improvement and handle the uncertainty quantification (via a discrepancy bound) for non-linear models.
The SLBO algorithm looks like: