Open JaCoderX opened 5 years ago
Implementing SAC would resolve this I think.
I opened this issue, partially to share my experience and to record the current limitations. incorrect value function estimation still looks like an open RL research issue.
But for sure working on SAC would be amazing way forward regardless.
incorrect value function estimation still looks like an open RL research issue.
if I remember correctly, 'Distributional Q-learning" approach from researches at Google Brain addresses this issue: https://www.youtube.com/watch?v=ba_l8IKoMvU btw: spot the listener leaving at ~39:50 ))
Implementing SAC would resolve this I think.
I'm playing with the idea of giving a try to implement SAC for btgym. It might be a bit of a stretch of my skill set in RL but can be an interesting challenge by itself.
The benefits of working with well established RL frameworks are clear. So, a few questions that comes to mind in that regard:
@JacobHanouna,
How can external repos in RL algorithms be integrated with btgym? and if it is even feasible under the current RL algorithmic part of btgym?
just throw out embedded algorithms and use btgym as standalone gym-API environment; some refactoring maybe necessary (e.g. btgym uses own spaces but those can be easily rolled back to standard gym spaces) to use it with frameworks like RlLib;
how can we wrap current btgym-specific algorithms to become agnostic
those has been intentionally adapted to domain while general implementations already exist
Seems this problem is deeper than I thought: https://bair.berkeley.edu/blog/2019/12/05/bear/
@Kismuz, I believe I have encountered a framework (A3C) limitation. While training a few of my recent models I noticed a strange behavior. For the first part of training everything seems to work fine, as indicated by tensorboard matrices (total reward and value function increases while entropy decrease). After couple of thousand steps the total reward and value function matrices no longer correlate. At first in a modest way (value function continue to increase while total reward hovers in place), but then what happens can be describe as - Policy breakdown (both matrices crash, entropy shots up and Agent actions seem to be almost random).
I searched online to try and identify the problem. I now believe that the issue I'm experiencing is a well knows limitation of value function overestimation as described in the following paper (as well as a way to mitigate the problem):
Addressing Function Approximation Error in Actor-Critic Methods
The above solution seems to be used also in more advanced Actor Critic frameworks (which is really interesting by itself) Soft Actor-Critic Algorithms and Applications
And a new paper (still unpublished) seems to expand the solution one step further Dynamically Balanced Value Estimates for Actor-Critic Methods