Overestimated Value Function in Actor Critic Framework

JaCoderX commented 5 years ago

@Kismuz, I believe I have encountered a framework (A3C) limitation. While training a few of my recent models I noticed a strange behavior. For the first part of training everything seems to work fine, as indicated by tensorboard matrices (total reward and value function increases while entropy decrease). After couple of thousand steps the total reward and value function matrices no longer correlate. At first in a modest way (value function continue to increase while total reward hovers in place), but then what happens can be describe as - Policy breakdown (both matrices crash, entropy shots up and Agent actions seem to be almost random).

I searched online to try and identify the problem. I now believe that the issue I'm experiencing is a well knows limitation of value function overestimation as described in the following paper (as well as a way to mitigate the problem):

Addressing Function Approximation Error in Actor-Critic Methods

In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic.

The above solution seems to be used also in more advanced Actor Critic frameworks (which is really interesting by itself) Soft Actor-Critic Algorithms and Applications

And a new paper (still unpublished) seems to expand the solution one step further Dynamically Balanced Value Estimates for Actor-Critic Methods

Reinforcement learning in an actor-critic setting relies on accurate value estimates of the critic. However, the combination of function approximation, temporal difference (TD) learning and off-policy training can lead to an overestimating value function. A solution is to use Clipped Double Q-learning (CDQ), which is used in the TD3 algorithm and computes the minimum of two critics in the TD-target. We show that CDQ induces an underestimation bias and propose a new algorithm that accounts for this by using a weighted average of the target from CDQ and the target coming from a single critic. The weighting parameter is adjusted during training such that the value estimates match the actual discounted return on the most recent episodes and by that it balances over- and underestimation.

Kismuz commented 5 years ago

Implementing SAC would resolve this I think.

JaCoderX commented 5 years ago

I opened this issue, partially to share my experience and to record the current limitations. incorrect value function estimation still looks like an open RL research issue.

But for sure working on SAC would be amazing way forward regardless.

Kismuz commented 5 years ago

incorrect value function estimation still looks like an open RL research issue.

if I remember correctly, 'Distributional Q-learning" approach from researches at Google Brain addresses this issue: https://www.youtube.com/watch?v=ba_l8IKoMvU btw: spot the listener leaving at ~39:50 ))

JaCoderX commented 5 years ago

Implementing SAC would resolve this I think.

I'm playing with the idea of giving a try to implement SAC for btgym. It might be a bit of a stretch of my skill set in RL but can be an interesting challenge by itself.

The benefits of working with well established RL frameworks are clear. So, a few questions that comes to mind in that regard:

How can external repos in RL algorithms be integrated with btgym? and if it is even feasible under the current RL algorithmic part of btgym?
how can we wrap current btgym-specific algorithms to become agnostic and modular so they could be easily integrated in external frameworks. (for example, btgym encoders implementations are modular and can be exported. but Stacked-LSTM implementations are deeply fused with btgym A3C implementations)

Kismuz commented 5 years ago

@JacobHanouna,

How can external repos in RL algorithms be integrated with btgym? and if it is even feasible under the current RL algorithmic part of btgym?

just throw out embedded algorithms and use btgym as standalone gym-API environment; some refactoring maybe necessary (e.g. btgym uses own spaces but those can be easily rolled back to standard gym spaces) to use it with frameworks like RlLib;

how can we wrap current btgym-specific algorithms to become agnostic

those has been intentionally adapted to domain while general implementations already exist

Kismuz commented 4 years ago

Seems this problem is deeper than I thought: https://bair.berkeley.edu/blog/2019/12/05/bear/

Kismuz commented 4 years ago

https://arxiv.org/pdf/1906.00949.pdf https://arxiv.org/pdf/1906.08253.pdf https://arxiv.org/pdf/1803.00101.pdf

Kismuz / btgym

Overestimated Value Function in Actor Critic Framework #124