Closed frederikschubert closed 3 years ago
This still needs some tests and I will try to do a rudimentary performance comparison.
I just ran a quick test on Pendulum-v1
with TD3 (turquoise) and SAC (red), both with 5-step TD.
Additionally, I ran SAC with NStepEntropyRegularizer
(blue, red) and EntropyRegularizer
(turquoise, pink) with 5-step TD and regularization weight of 0.2 (default in many SAC implementations) and 0.2/5=0.04.
At least after this one sample run per variant, the performance of the n-step entropy regularization seems to be better and the entropies are also behaving as expected, i.e. more steps in the entropy bonus leads to higher entropy.
[...] more steps in the entropy bonus leads to higher entropy.
This is interesting. Theoretically the agents should optimize the same objective (given the same hparams), at least asymptotically. I would expect them to converge to roughly the same points (in terms of entropy etc). In other words, I would expect them to differ only in how fast they converge.
record_extra_info
flag to theNStep
tracer that records the intermediate states in the newextra_info
field toTransitionBatch
NStepEntropyRegularizer
inSoftPG
This PR contains an initial working implementation of the mechanism and sums um the discounted entropy bonuses of the states
s_t
,s_{t + 1}
, ... ,s_{t + n - 1}
for the soft policy gradient regularization.