Closed frederikschubert closed 3 years ago
Maybe we can use the ClippedDoubleQLearning
(instead of introducing a "Soft" version that is basically the same) and add a parameter whether the action should be sampled from the target policy or if the mode should be taken?
Wow you're fast!
Maybe we can use the
ClippedDoubleQLearning
(instead of introducing a "Soft" version that is basically the same) and add a parameter whether the action should be sampled from the target policy or if the mode should be taken?
Yes, that would probably be cleaner.
I personally prefer the not to have too many arguments exposed to the user. Perhaps it's worth the duplication. I don't feel strongly one way or another. Do what you think is best for new users.
Yes, I also went with the duplication because it better separates both methods conceptually and there probably won't be many complicated refactorings that can't be applied easily to the duplicated code. I will leave a comment though as a pointer.
Since SAC is really similar to TD3, we are able to re-use most of its components. The differences are:
The current implementation does not support multi-step td-learning.