SARSA - Githubissues

EderSantana / X

X is a temporary name, but here lies RL

BSD 3-Clause "New" or "Revised" License

40 stars 8 forks source link

To implement SARSA with experience replay:

The memory module should not compute "targets" or TD error. Memory should just store state/action/reward/next state information, and provide it in batches upon request.
The memory module should have a provide a large number of options for how batches are returned. For the case of SARSA, state/action/reward/next state/next state action can be returned. For q-learning, just state/action/reward/next state is required
It might be more efficient and easier in some ways to store memory in separate state, action, reward vectors.
"Targets" should be computed by the models themselves, perhaps calling q-learning / sarsa classes to help compute.
Policy should not be used to compute TD error. Policy and td-error are unrelated. Given enough time, you could compute an optimal Q-function using a completely random policy.

EderSantana / X