Closed Michael-Beukman closed 2 months ago
Thanks for finding this issue. I'm testing a branch with a fix for (1) and changing (2) to be consistent with the power-based temperature setting. I'll run a sweep to see the impact on performance and share the update.
In
src/minimax/util/rl/plr.py
, the_get_replay_dist
function, I think there may be two problems.1/jnp.arange(self.buffer_size)
there is a division by zero.score_dist = scores/self.temp
, the score distribution is divided by the temperature, instead of being taken to the power of (1/temp) as in the original prioritised level replay paper.