Reward shaping - Githubissues

https://www.reddit.com/r/reinforcementlearning/comments/k27lnv/do_strictly_negative_rewards_work_with_discounting/gdt5v0m/

The first problem you have is that the agent has no incentive to minimize each episode's length. Assuming the agent is able to get a terminal reward of 0, the discounted reward will be the same no matter how long each episode is. You should change 0 to some positive number if you want each episode to be as short as possible As to your other concern, depending on how difficult it is to get a terminal reward of 0, this may be challenging. The agent could be stuck at a local maximum by trying to just maximize the length of each episode, thereby ending up at a completely different peak. Of course, this is based on the specifications of the environment as each one is different. If the agent is able to get a reward of 0 through exploration at least a few times, this should not be a problem. It would be best to increase the exploration at the beginning to avoid your agent getting stuck at a local maxima. It might be worth it to try an algorithm like Q-Learning as opposed to a policy based one if you are trying to solve the environment. While policy based methods like PPO are considered more stable, due to the way exploration is handled in them, they are more likely to get stuck in a local maxima. If you want to know more about ways to increase exploration, try reading about curiosity-driven-learning.

Tuxliri / RL_rocket

Reward shaping #5