Stable-Baselines-Team / stable-baselines3-contrib

Contrib package for Stable-Baselines3 - Experimental reinforcement learning (RL) code
https://sb3-contrib.readthedocs.io
MIT License
466 stars 173 forks source link

[Feature Request] Implement Recurrent SAC #201

Open masterdezign opened 1 year ago

masterdezign commented 1 year ago

🚀 Feature

Hi!

I would like to implement a recurrent soft actor-critic. Is it a sensible contribution?

Motivation

I actually need this algorithm in my projects.

Pitch

The sb3 ecosystem would benefit from yet another algorithm. As a new contributor, I might need a little guidance though.

Alternatives

An alternative would be another off-policy algorithm using LSTM.

Additional context

No response

Checklist

araffin commented 1 year ago

Hello, this would be definitely a good addition to SB3 contrib.

Make sure to read the contributing guide carefully. You might have a look at R2D2 paper (https://paperswithcode.com/method/r2d2) and https://github.com/zhihanyang2022/off-policy-continuous-control.

For benchmarking, best would be to use the "NoVel" env that are available in the RL Zoo (see https://wandb.ai/sb3/no-vel-envs/reports/PPO-vs-RecurrentPPO-aka-PPO-LSTM-on-environments-with-masked-velocity-SB3-Contrib---VmlldzoxOTI4NjE4).

masterdezign commented 1 year ago

Thanks for the references. I will check them out and come back.

masterdezign commented 12 months ago

Just a quick update: I plan to do this by the end of 2023 when I have some free time. Currently I have three higher priority projects.

masterdezign commented 9 months ago

Status update:

  1. I've checked the resources that you provided, thanks a lot. I find the code to be nicely written and quite easy to understand.
  2. I managed to solve PendulumNoVel-v1 from rl_zoo3==2.1.0 with RSAC.
  3. However, I have trouble solving MountainCarContinuousNoVel-v0 and LunarLanderContinuousNoVel-v2 using the code above with different configurations.
  4. Therefore, I may need to resort to modify the algorithm (such as e.g. using the same LSTM state by the actor and critics, using overlapping segments, etc.).
  5. EDIT I've checked your benchmarks and realized that LunarLander may require more timesteps (it takes up to 5M for PPO LSTM).
masterdezign commented 9 months ago

Comparison

I've got these results on LunarLanderContinuousNoVel-v2 (rl_zoo3==2.1.0) using RSAC with shared LSTM state (rsac_s) and RSAC. In both cases, the configuration was the same:

# ====================================================================================
# gin macros
# ====================================================================================

capacity = 1000
batch_size = 10
segment_len = 50

num_epochs = 500
num_steps_per_epoch = 10000
update_after = 10000
num_test_episodes_per_epoch = 10

# ====================================================================================
# applying the parameters
# ====================================================================================

import basics.replay_buffer_recurrent
import basics.run_fns

basics.replay_buffer_recurrent.RecurrentReplayBuffer.capacity = %capacity
basics.replay_buffer_recurrent.RecurrentReplayBuffer.batch_size = %batch_size
basics.replay_buffer_recurrent.RecurrentReplayBuffer.segment_len = %segment_len

basics.run_fns.train.num_epochs = %num_epochs
basics.run_fns.train.num_steps_per_epoch = %num_steps_per_epoch
basics.run_fns.train.num_test_episodes_per_epoch = %num_test_episodes_per_epoch
basics.run_fns.train.update_after = %update_after

It took about 20 hours to compute per run. Perhaps now this rsac_s architecture can be implemented in sb3-contrib.

rsac_s-241

araffin commented 8 months ago

Hello, thanks for reporting the updated results =). Do you have a diagram to share for RSAC vs RSAC_s maybe? (that would make things easier to discuss)

Di you also manage to solve the mountain car problem?

masterdezign commented 8 months ago

Di you also manage to solve the mountain car problem?

I believe, yes. Let me render the env to verify since rewards are not the same for MountainCarContinuousNoVel-v0 (continuous action space) and MountainCar-v0 (discrete action space).

masterdezign commented 8 months ago

Loosely speaking, here they are:


           RSAC                        RSAC_S

     ┌─────┐    ┌─────┐               ┌─────┐
     │ RNN │    │ RNN │             ┌─┤ RNN │..
     └──┬──┘    └──┬──┘             │ └─────┘ .
        │          │                │         .
        │          │                │         .
    ┌───┴───┐  ┌───┴────┐       ┌───┴───┐  ┌────────┐
    │ Actor │  │ Critic │       │ Actor │  │ Critic │
    └───────┘  └────────┘       └───────┘  └────────┘

As you can see, RSAC_S share the RNN state between the actor and the critic, but only actor can change the RNN state. Whereas in RSAC actor and critics have their own RNN states.

araffin commented 8 months ago

As you can see, RSAC_S share the RNN state between the actor and the critic, but only actor can change the RNN state. Whereas in RSAC actor and critics have their own RNN states.

thanks, similar to what is implemented for PPO: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/588c6bdaeaa118a075162eddcd77c753d880bee2/sb3_contrib/common/recurrent/policies.py#L238-L247

masterdezign commented 8 months ago

Update, I just rendered MountainCarContinuousNoVel-v0 and it is not solved yet. I don't quite understand why the total reward is different between the original MountainCar-v0 env and this one. Therefore, I need to check MountainCarContinuousNoVel-v0 (and MountainCarContinuous-v0) in details.

araffin commented 8 months ago

i can help you with that, the continuous version has a deceptive reward and need quite some exploration noise

EDIT: working hyperparameters: https://github.com/DLR-RM/rl-baselines3-zoo/blob/8cecab429726d7e6aaebd261d26ed8fc23b7d948/hyperparams/sac.yml#L2 or https://github.com/DLR-RM/rl-baselines3-zoo/blob/8cecab429726d7e6aaebd261d26ed8fc23b7d948/hyperparams/td3.yml#L5-L6

(note: the gSDE exploration is important there, otherwise a high OU noise would work too)

masterdezign commented 8 months ago

Thanks, I'll check those hyperparameters.

masterdezign commented 8 months ago

Indeed, having use_sde=True seems helping to solve MountainCarContinuous-v0 environment. I am curious which gSDE ingredient does exactly help.

Edit: I also tried nearby hyperparameters and indeed gSDE contribution seems to be non-negligible.

araffin commented 8 months ago

I am curious which gSDE ingredient does exactly help.

The consistent exploration. To solve this task, you need to build-up momentum, having a bang-bang like strategy is one way (it is discuss a bit longer in the first version of the paper: https://arxiv.org/pdf/2005.05719v1.pdf).

Edit: I also tried nearby hyperparameters and indeed gSDE contribution seems to be non-negligible.

I did a full hyperparameters search and with gSDE many are working (more than half of the tested configurations): https://github.com/DLR-RM/rl-baselines3-zoo/blob/sde/logs/report_sde_MountainCarContinuous-v0_500-trials-50000-tpe-median_1581693633.csv

masterdezign commented 8 months ago

I am currently checking the two strategies for RNN state initialization, proposed in R2D2 paper (store state and burn-in).

masterdezign commented 8 months ago

So far I've got this: recurrent replay buffer with overlapping chunks supporting SB3 interface. I also wrote a specification (test) to reduce future surprises.

https://gist.github.com/masterdezign/47b3c6172dd1624bb9a7ef23cbc79c8c

The limitation is n_envs = 1. This can be resolved in the future.

masterdezign commented 5 months ago

Hi! I didn't obtain good results and then I had to put the project on hold. I plan to restart working on it starting from tomorrow.