coax-dev / coax

Modular framework for Reinforcement Learning in python
https://coax.readthedocs.io
MIT License
168 stars 17 forks source link

Implementation of SAC #6

Closed frederikschubert closed 3 years ago

frederikschubert commented 3 years ago

Since SAC is really similar to TD3, we are able to re-use most of its components. The differences are:

The current implementation does not support multi-step td-learning.

frederikschubert commented 3 years ago

Maybe we can use the ClippedDoubleQLearning (instead of introducing a "Soft" version that is basically the same) and add a parameter whether the action should be sampled from the target policy or if the mode should be taken?

KristianHolsheimer commented 3 years ago

Wow you're fast!

Maybe we can use the ClippedDoubleQLearning (instead of introducing a "Soft" version that is basically the same) and add a parameter whether the action should be sampled from the target policy or if the mode should be taken?

Yes, that would probably be cleaner.

frederikschubert commented 3 years ago

I personally prefer the not to have too many arguments exposed to the user. Perhaps it's worth the duplication. I don't feel strongly one way or another. Do what you think is best for new users.

Yes, I also went with the duplication because it better separates both methods conceptually and there probably won't be many complicated refactorings that can't be applied easily to the duplicated code. I will leave a comment though as a pointer.