Addressing Function Approximation Error in Actor-Critic Methods By: Scott Fujimoto, Herke van Hoof, David Meger

Link: Arxiv

This is first published on Feb.2018 almost same time as SAC first publish, and it gets updated on Oct.2018 TD3 is very simple to implement but the limitation is that it only works for continuous action space since it's the enhanced version of DDPG.

There are a lot of repository for TD3.

Ref1: https://spinningup.openai.com/en/latest/algorithms/td3.html Ref2: https://towardsdatascience.com/td3-learning-to-run-with-ai-40dfc512f93

Problem: In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies.

Innovation:

We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic. Our algorithm builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation. We draw the connection between target networks and overestimation bias, and suggest delaying policy updates to reduce per-update error and further improve performance. We evaluate our method on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested.

Comment: OpenAI:

Quick Facts TD3 is an off-policy algorithm. TD3 can only be used for environments with continuous action spaces. The Spinning Up implementation of TD3 does not support parallelization.

A major contribution is using of a so called Clipped Double Q-Learning for Actor-Critic. Author used two value network for critic and delayed update for one actor networks, and same settings for target network

QiXuanWang / LearningFromTheBest

Addressing Function Approximation Error in Actor-Critic Methods By: Scott Fujimoto, Herke van Hoof, David Meger #25