Closed MieszkoFerens closed 4 years ago
Although SARSA algorithm is implemented in the current version of ChainerRL, it is not mentioned on the GitHub page.
Right, thanks for pointing it out. I think it might be covered by "DQN (including DoubleDQN etc.)", but I admit it is confusing.
Looking at the code I don't see why this would be considered off-policy.
It is off-policy in a sense that it learns the Q-function of the current behavior policy, defined by the current approximate Q-function and an explorer, from data collected by past behavior policies.
Thanks for you answer.
So, does this mean that the "SARSA" implementation which is available in ChainerRL is different from the canonical SARSA algorithm given, for instance, at the RL book by Sutton and Barto, where SARSA is defined as an on-policy method?
So, does this mean that the "SARSA" implementation which is available in ChainerRL is different from the canonical SARSA algorithm given, for instance, at the RL book by Sutton and Barto, where SARSA is defined as an on-policy method?
Correct. It can be considered as a sample-based approximation of Expected SARSA.
Although SARSA algorithm is implemented in the current version of ChainerRL, it is not mentioned on the GitHub page. Additionaly, in the API the brief description for this algorithm seems to indicate that it is on-policy SARSA not off-policy, as stated there: "This agent learns the Q-function of a behavior policy defined via the given explorer, instead of learning the Q-function of the optimal policy." Looking at the code I don't see why this would be considered off-policy.
Is there a reason why this SARSA is off-policy or is it a mistake?