Open araffin opened 4 years ago
@PartiallyTyped I thought about that one, and we just need to change the sampling not the storage, no? (as a first approximation)
What I mean: at sampling time, we could re-create the trajectory (until a done is found or the buffer ends) by simply going through the indexes.
This approach sounds better than what I initially came up with, seems to have fewer moving parts and will be easier to reason about. I will get on it once V1.0 is released.
How would you like this to be implemented? As a wrapper around the buffer, as a derived class from the buffer, or as it's own object that adheres to the buffer API?
A class that derives from the replay buffer class seems the natural option I would say.
As an update, I have an experimental version of SAC + Peng Q-Lambda in the contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/tree/feat/peng-q-lambda I'm using an adapted version of the HER replay buffer (storing things by episodes) which can probably easily be updated for an n-step buffer (in fact, lambda=1 is the n-step version). I also had to hack a bit SAC in order to have access to actor and target q-value.
Original repo by @robintyh1: https://github.com/robintyh1/icml2021-pengqlambda
Originally posted by @PartiallyTyped in https://github.com/hill-a/stable-baselines/issues/821 " N-step returns allow for much better stability, and improve performance when training DQN, DDPG etc, so it will be quite useful to have this feature.
A simple implementation of this would be as a wrapper around ReplayBuffer so it would work with both Prioritized and Uniform sampling. The wrapper keeps a queue of observed experiences compute the returns and add the experience to the buffer. "
Roadmap: v1.1+ (see #1 )