Multi-Step Entropy Regularization for SAC

coax-dev / coax

Modular framework for Reinforcement Learning in python

https://coax.readthedocs.io

MIT License

168 stars 17 forks source link

Multi-Step Entropy Regularization for SAC #7

Closed frederikschubert closed 3 years ago

frederikschubert commented 3 years ago

Add record_extra_info flag to the NStep tracer that records the intermediate states in the new extra_info field to TransitionBatch
Add support for the NStepEntropyRegularizer in SoftPG

This PR contains an initial working implementation of the mechanism and sums um the discounted entropy bonuses of the states s_t, s_{t + 1}, ... , s_{t + n - 1} for the soft policy gradient regularization.

frederikschubert commented 3 years ago

This still needs some tests and I will try to do a rudimentary performance comparison.

frederikschubert commented 3 years ago

Episode Return SAC vs. TD3

I just ran a quick test on Pendulum-v1 with TD3 (turquoise) and SAC (red), both with 5-step TD.

Additionally, I ran SAC with NStepEntropyRegularizer (blue, red) and EntropyRegularizer (turquoise, pink) with 5-step TD and regularization weight of 0.2 (default in many SAC implementations) and 0.2/5=0.04. Entropy

At least after this one sample run per variant, the performance of the n-step entropy regularization seems to be better and the entropies are also behaving as expected, i.e. more steps in the entropy bonus leads to higher entropy.

Episode Return SAC

KristianHolsheimer commented 3 years ago

[...] more steps in the entropy bonus leads to higher entropy.

This is interesting. Theoretically the agents should optimize the same objective (given the same hparams), at least asymptotically. I would expect them to converge to roughly the same points (in terms of entropy etc). In other words, I would expect them to differ only in how fast they converge.