[feature-request] N-step returns for TD methods

DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

https://stable-baselines3.readthedocs.io

MIT License

9.02k stars 1.69k forks source link

[feature-request] N-step returns for TD methods #47

Open araffin opened 4 years ago

araffin commented 4 years ago

Originally posted by @PartiallyTyped in https://github.com/hill-a/stable-baselines/issues/821 " N-step returns allow for much better stability, and improve performance when training DQN, DDPG etc, so it will be quite useful to have this feature.

A simple implementation of this would be as a wrapper around ReplayBuffer so it would work with both Prioritized and Uniform sampling. The wrapper keeps a queue of observed experiences compute the returns and add the experience to the buffer. "

Roadmap: v1.1+ (see #1 )

araffin commented 4 years ago

@PartiallyTyped I thought about that one, and we just need to change the sampling not the storage, no? (as a first approximation)

What I mean: at sampling time, we could re-create the trajectory (until a done is found or the buffer ends) by simply going through the indexes.

m-rph commented 4 years ago

This approach sounds better than what I initially came up with, seems to have fewer moving parts and will be easier to reason about. I will get on it once V1.0 is released.

m-rph commented 4 years ago

How would you like this to be implemented? As a wrapper around the buffer, as a derived class from the buffer, or as it's own object that adheres to the buffer API?

araffin commented 4 years ago

A class that derives from the replay buffer class seems the natural option I would say.

araffin commented 3 years ago

As an update, I have an experimental version of SAC + Peng Q-Lambda in the contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/tree/feat/peng-q-lambda I'm using an adapted version of the HER replay buffer (storing things by episodes) which can probably easily be updated for an n-step buffer (in fact, lambda=1 is the n-step version). I also had to hack a bit SAC in order to have access to actor and target q-value.

Original repo by @robintyh1: https://github.com/robintyh1/icml2021-pengqlambda