Tuxliri / RL_rocket

Repository for the development of my master thesis on control of launch vehicles descent and landing through reinforcement learning actors.
0 stars 0 forks source link

Reward shaping #5

Open Tuxliri opened 2 years ago

Tuxliri commented 2 years ago

Observation: If the reward is negative at the last timestep, then the longer the environment remains non-negative, the better. The agent would be correctly learning to maximize the reward. See https://www.reddit.com/r/reinforcementlearning/comments/k27lnv/do_strictly_negative_rewards_work_with_discounting/

Tuxliri commented 2 years ago

https://www.reddit.com/r/reinforcementlearning/comments/k27lnv/do_strictly_negative_rewards_work_with_discounting/gdt5v0m/

The first problem you have is that the agent has no incentive to minimize each episode's length. Assuming the agent is able to get a terminal reward of 0, the discounted reward will be the same no matter how long each episode is. You should change 0 to some positive number if you want each episode to be as short as possible As to your other concern, depending on how difficult it is to get a terminal reward of 0, this may be challenging. The agent could be stuck at a local maximum by trying to just maximize the length of each episode, thereby ending up at a completely different peak. Of course, this is based on the specifications of the environment as each one is different. If the agent is able to get a reward of 0 through exploration at least a few times, this should not be a problem. It would be best to increase the exploration at the beginning to avoid your agent getting stuck at a local maxima. It might be worth it to try an algorithm like Q-Learning as opposed to a policy based one if you are trying to solve the environment. While policy based methods like PPO are considered more stable, due to the way exploration is handled in them, they are more likely to get stuck in a local maxima. If you want to know more about ways to increase exploration, try reading about curiosity-driven-learning.