Implementation of SAC - Githubissues

coax-dev / coax

Modular framework for Reinforcement Learning in python

https://coax.readthedocs.io

MIT License

168 stars 17 forks source link

Implementation of SAC #6

Closed frederikschubert closed 3 years ago

frederikschubert commented 3 years ago

Since SAC is really similar to TD3, we are able to re-use most of its components. The differences are:

The actions to update the q-functions and policy are sampled using the current policy (instead of taking the mode).
There is no target policy.
The log variance of the policy depends on the state.
The policy is entropy regularized.

The current implementation does not support multi-step td-learning.

frederikschubert commented 3 years ago

Maybe we can use the ClippedDoubleQLearning (instead of introducing a "Soft" version that is basically the same) and add a parameter whether the action should be sampled from the target policy or if the mode should be taken?

KristianHolsheimer commented 3 years ago

Wow you're fast!

Maybe we can use the ClippedDoubleQLearning (instead of introducing a "Soft" version that is basically the same) and add a parameter whether the action should be sampled from the target policy or if the mode should be taken?

Yes, that would probably be cleaner.

frederikschubert commented 3 years ago

I personally prefer the not to have too many arguments exposed to the user. Perhaps it's worth the duplication. I don't feel strongly one way or another. Do what you think is best for new users.

Yes, I also went with the duplication because it better separates both methods conceptually and there probably won't be many complicated refactorings that can't be applied easily to the duplicated code. I will leave a comment though as a pointer.