It could be useful to bound rewards within the [-1;1] range.
Now you can find values such as -100 or 50. This was done to highly motivate the agent to do or not to do actions.
These values are used in the MSE loss function, which could lead to gradient explosions and break the learning process.
It could be useful to bound rewards within the [-1;1] range. Now you can find values such as -100 or 50. This was done to highly motivate the agent to do or not to do actions. These values are used in the MSE loss function, which could lead to gradient explosions and break the learning process.