Closed DeepRowLie closed 5 months ago
Hello, SAC with RNN and PPO with RNN will be quite different because PPO is on-policy (so the data collected is discarded after one update).
why should we use the outdated 'lstm_states' to reconstruct LSTM states instead of just initializing them,when the sequence doesnt start from the begining? when appling RNN to SAC's replay buffer,how should LSTM states be reconsturcted?
This is mostly to have a better initialization for the lstm states than constant or random values. An alternative that is especially relevant for off-policy algorithms is to use warmup steps (see R2D2 paper) to initialize the lstm states before doing any gradient update. It requires however to store more data and suppose that the episode is long enough to perform those steps.
See https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/issues/201 for some pointers for SAC.
thanks
❓ Question
Hi all! I hope to integrate RNN(LSTM/GRU) to off-policy algorithm(SAC and TD3) without multiprocessing like A3C.So I checked SB3-contrib code about recurrentPPO and the recurrentPPO document you recommended. In SB3-contrib, recurrentPPO puts _lstmstates into _rolloutbuffer when collect transitions.During Training,the sequences not start from the beginning of one episode will use the out of style _lstmstates to reconstruct LSTM states.I'm confused about this. https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/9f333ffc34280fd16f438ff303f7b3f7792b0068/sb3_contrib/ppo_recurrent/ppo_recurrent.py#L351C1-L356C18 https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/9f333ffc34280fd16f438ff303f7b3f7792b0068/sb3_contrib/common/recurrent/policies.py#L198C9-L207C36
Here are my questions:
Checklist