DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.91k stars 1.68k forks source link

Can DDPG make use of differentiable reward function? #1520

Closed hhroberthdaniel closed 1 year ago

hhroberthdaniel commented 1 year ago

❓ Question

For a problem where the reward is differentiable, Policy Gradients can make use of this to further optimize the model. Will sb3 allow this differentiation up to the reward function, particularly for the DDPG algorithm?

Checklist

araffin commented 1 year ago

where the reward is differentiable, Policy Gradients can make use of this to further optimize the model.

If the reward is differentiable, maybe RL is not a good fit as you have a supervised learning signal (much stronger and less noisy than the classic reward signal).

I guess you might be referring to https://github.com/erwincoumans/tiny-differentiable-simulator or https://github.com/google/brax

For analytical policy gradient, there is an implementation here: https://github.com/google/brax/tree/main/brax/training/agents/apg

Will sb3 allow this differentiation up to the reward function, particularly for the DDPG algorithm?

SB3 won't allow you to do that out of the box, you would need to fork SB3 and replace the part that interacts with the environment.

hhroberthdaniel commented 1 year ago

Thanks for the quick response. I cannot use supervised learning because I don't have the labels. Also, the environment is difficult to solve in "one action" , so I need something that can "build" the solution.

I am not referring to the simulator you mentioned. The environment is not differentiable, only the reward function is.

If you have other suggestions, I would very much appreciate them

araffin commented 1 year ago

The link for analytical policy gradient is what you are looking for i guess, i also remember that some hard-attention papers (for vision tasks) used gradients too.

araffin commented 1 year ago

As an alternative solution, you might have a look at the dreamer algorithm (model based).