Open Yongw-Z opened 1 year ago
On the other hand, Q-learning learns the shortest path by updating the value with the best action, despite the high risk of falling into a trap.
In conclusion, the convergence of the value function is slower with SARSA than with Q-learning due to the effect of randomness resulting from 𝜖-greedy method, but you should consider using SARSA for situations where the cost of trial-and-error is high.
As can be seen from the paths shown in the .ipynb file, in the case of SARSA the agent is learning a path to the goal by taking a detour and avoiding traps as much as possible.
Recalling the update formula of the SARSA algorithm, in the case of SARSA the agent acquires a safe strategy because the actual actions it takes affect the value update.