Closed hhroberthdaniel closed 1 year ago
where the reward is differentiable, Policy Gradients can make use of this to further optimize the model.
If the reward is differentiable, maybe RL is not a good fit as you have a supervised learning signal (much stronger and less noisy than the classic reward signal).
I guess you might be referring to https://github.com/erwincoumans/tiny-differentiable-simulator or https://github.com/google/brax
For analytical policy gradient, there is an implementation here: https://github.com/google/brax/tree/main/brax/training/agents/apg
Will sb3 allow this differentiation up to the reward function, particularly for the DDPG algorithm?
SB3 won't allow you to do that out of the box, you would need to fork SB3 and replace the part that interacts with the environment.
Thanks for the quick response. I cannot use supervised learning because I don't have the labels. Also, the environment is difficult to solve in "one action" , so I need something that can "build" the solution.
I am not referring to the simulator you mentioned. The environment is not differentiable, only the reward function is.
If you have other suggestions, I would very much appreciate them
The link for analytical policy gradient is what you are looking for i guess, i also remember that some hard-attention papers (for vision tasks) used gradients too.
As an alternative solution, you might have a look at the dreamer algorithm (model based).
❓ Question
For a problem where the reward is differentiable, Policy Gradients can make use of this to further optimize the model. Will sb3 allow this differentiation up to the reward function, particularly for the DDPG algorithm?
Checklist