Now the reward for bandits is not stable since in the very beggining of optimization the given rewards for actions are much bigger than in the end of optimization. Therefore reward must be calculated with the use of sliding window and other tricks like in this article.
Now the reward for bandits is not stable since in the very beggining of optimization the given rewards for actions are much bigger than in the end of optimization. Therefore reward must be calculated with the use of sliding window and other tricks like in this article.