Recovery policy for RL tasks with discrete action space

abalakrishna123 / recovery-rl

Implementation of Recovery RL: Safe Reinforcement Learning With Learned Recovery Zones.

https://sites.google.com/berkeley.edu/recovery-rl/

MIT License

51 stars 18 forks source link

Recovery policy for RL tasks with discrete action space #1

Closed Lplenka closed 2 years ago

Lplenka commented 2 years ago

Hello @abalakrishna123 @bthananjeyan

Thanks for sharing this repo. I read your paper Recovery RL: Safe Reinforcement Learning with Learned Recovery Zones. The idea of using separate recovery and task policies is fascinating.

As far as I understood from the paper, you have performed experiments having a continuous action space. What are your thoughts on using this strategy for environments with discrete action space? Do you think it's possible to implement the same?

abalakrishna123 commented 2 years ago

Yep implementing Recovery RL for discrete actions spaces should work fine. Since you can use any off policy RL algorithm for the recovery policy, you can replace the default SAC recovery policy with an adapted version for discrete actions as shown here: https://github.com/ku2482/sac-discrete.pytorch. You can also use a DQN agent for the recovery policy. The model-based recovery policy can be adapted to discrete actions in a similarly straightforward manner, as you can learn a model over the discrete actions and then use similar shooting-based planning techniques to plan actions.

Lplenka commented 2 years ago

Thank you for your reply @abalakrishna123. I will try it in an environment with discrete action space. Sorry to bother you, but I have two more questions:

1) Would the Recovery RL technique also work with on-policy algorithms like PPO?

2) The model-based recovery policy learns the model dynamics gradually over the episodes? Is this correct understanding, or does it require prior information about the environment and agent dynamics?

Thanks in advance. These questions would help me in my research.

abalakrishna123 commented 2 years ago

You can use any RL technique for the forward policy, including PPO. However, for the recovery policy you do need an off-policy RL algorithm because it is trained on all actions executed in the environment, whether they be from the task policy or the recovery policy.
The model-based recovery policy learns an initial estimate of model dynamics from the offline data of unsafe transitions used to initialize the recovery policy, and this estimate is updated online with all experience collected by the agent. Thus, no explicit prior information about the environment is required to use a model-based recovery policy.

Lplenka commented 2 years ago

Thanks for the explanation. I will try to implement this.