Comment:
Published on 2020. May. This is a tutorial paper on offline RL.
Problem:
Offline RL: reinforcement learning algorithms that utilize previously collected data, without
additional online data collection.
However, the fact that reinforcement learning algorithms provide a fundamentally online learning
paradigm is also one of the biggest obstacles to their widespread adoption
Summary:
RL algorithm category:
1.1 Policy Gradient
1.2 Approximate dynamic programming(Yu: Value Function Method e.g.: DQN)
1.3 Actor-Critic
1.4. Model-based reinforcement learning (Yu: why it's a separate category?)
Offline RL
Q-learning algorithms, actor-critic algorithms that utilize Q-functions, and many model-based reinforcement learning algorithm are off-policy algorithms. However, off-policy algorithms still often employ additional interaction (i.e., online data collection) during the learning process. Therefore, the term “fully off-policy” is sometimes used to indicate that no additional online data collection is performed
2.4 What Makes Offline Reinforcement Learning Difficult?
A more subtle but practically more important challenge is about making and answering ounterfactual queries. Counterfactual queries are, intuitively, “what if” questions. The fundamental challenge with making such counterfactual queries is distributional shift: while our function
approximator (policy, value function, or model) might be trained under one distribution, it will be evaluated on a different distribution
3 Offline Evaluation and Reinforcement Learning via Importance Sampling
Yu: PPO is one of this?
4 Offline Reinforcement Learning via Dynamic Programming
4.2 Distributional Shift in Offline Reinforcement Learning via Dynamic Programming
Yu: SAC is one of this.
Offline Model-Based Reinforcement Learning
5.1 Model Exploitation and Distribution Shift
5.3 Challenges and Open Problems
model-based reinforcement learning appears to be a natural fit for the offline RL problem
setting ...
7 Discussion and Perspectives
As a result, the standard off-policy training methods in these two categories have generally proven unsuitable for the kinds of complex domains typically studied in modern deep reinforcement learning
key challenge in offline RL: distributional shift due to differences between the learned policy and the behavior policy.
It is also still an open theoretical question as to whether model-based RL methods even in theory can improve over model-free dynamic programming algorithms.
Link: https://arxiv.org/pdf/2005.01643v1.pdf
Comment: Published on 2020. May. This is a tutorial paper on offline RL.
Problem: Offline RL: reinforcement learning algorithms that utilize previously collected data, without additional online data collection. However, the fact that reinforcement learning algorithms provide a fundamentally online learning paradigm is also one of the biggest obstacles to their widespread adoption
Summary:
RL algorithm category: 1.1 Policy Gradient 1.2 Approximate dynamic programming(Yu: Value Function Method e.g.: DQN) 1.3 Actor-Critic 1.4. Model-based reinforcement learning (Yu: why it's a separate category?)
Offline RL Q-learning algorithms, actor-critic algorithms that utilize Q-functions, and many model-based reinforcement learning algorithm are off-policy algorithms. However, off-policy algorithms still often employ additional interaction (i.e., online data collection) during the learning process. Therefore, the term “fully off-policy” is sometimes used to indicate that no additional online data collection is performed 2.4 What Makes Offline Reinforcement Learning Difficult? A more subtle but practically more important challenge is about making and answering ounterfactual queries. Counterfactual queries are, intuitively, “what if” questions. The fundamental challenge with making such counterfactual queries is distributional shift: while our function approximator (policy, value function, or model) might be trained under one distribution, it will be evaluated on a different distribution
3 Offline Evaluation and Reinforcement Learning via Importance Sampling Yu: PPO is one of this?
4 Offline Reinforcement Learning via Dynamic Programming 4.2 Distributional Shift in Offline Reinforcement Learning via Dynamic Programming Yu: SAC is one of this.
7 Discussion and Perspectives As a result, the standard off-policy training methods in these two categories have generally proven unsuitable for the kinds of complex domains typically studied in modern deep reinforcement learning key challenge in offline RL: distributional shift due to differences between the learned policy and the behavior policy. It is also still an open theoretical question as to whether model-based RL methods even in theory can improve over model-free dynamic programming algorithms.