QiXuanWang / LearningFromTheBest

This project is to list the best books, courses, tutorial, methods on learning certain knowledge
8 stars 1 forks source link

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor By: Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine #24

Open QiXuanWang opened 4 years ago

QiXuanWang commented 4 years ago

Link: Arxiv Followup: Soft Actor-Critic Algorithms and Applications

Ref1: https://spinningup.openai.com/en/latest/algorithms/sac.html Ref2: https://bair.berkeley.edu/blog/2018/12/14/sac/ Ref3: https://towardsdatascience.com/soft-actor-critic-demystified-b8427df61665

This is published on Jan. 2018. Almost same time as TD3 and is the STOA algorithm for a lot of continuous control problems. Continuous because they inherits the DDPG approach. And they kept to be STOA until now for MFRL? But apparently some MBRL algorithms proposed in 2019 achieved some better performance, like in #9

Problem:

Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning

Innovation: we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework.

Comment: OpenAI:

Quick Facts SAC is an off-policy algorithm. The version of SAC implemented here can only be used for environments with continuous action spaces. An alternate version of SAC, which slightly changes the policy update rule, can be implemented to handle discrete action spaces. The Spinning Up implementation of SAC does not support parallelization.

TowardDataScience:

If a random variable can be any Real Number with equal probability then it has very high entropy as it is very unpredictable. Why do we want our policy to have high entropy? We want a high entropy in our policy to explicitly encourage exploration, to encourage the policy to assign equal probabilities to actions that have same or nearly equal Q-values, and also to ensure that it does not collapse into repeatedly selecting a particular action that could exploit some inconsistency in the approximated Q function. Therefore, SAC overcomes the brittleness problem by encouraging the policy network to explore and not assign a very high probability to any one part of the range of actions.