PLR Reset Behaviour - Githubissues

Yes, in the current implementation, PLRRunner will sample domain randomized levels after the first episode when each rollout dimension completes. A simple fix would be to directly call reset_state in a separate runner, though new hyperparameters would have to be determined for the updated behavior.

However, one issue with this simple fix is that resetting to the first level per rollout dimension would mean potentially training on a significantly fewer number of distinct levels, depending on the average episode length. The previous PyTorch implementation resamples from the PLR buffer itself when resetting the episode in a rollout. Implementing a similar behavior would be slightly more involved, as the PLR buffer is stateful, and the state updates with every sample (due to the staleness scores). I sketched an approach a few months back, but haven't had time to implement it.

Another possibility is to try out new heuristics for tracking staleness that are simpler for bookkeeping in JAX. I think this approach may be most practical and effective.

For now I'll look into simply resetting to the first level per rollout batch. I'll run some sweeps on this evaluation and share the update.

facebookresearch / minimax

PLR Reset Behaviour #6