The reward function of 'Jumper-v0' might be incorrect

drdh commented 1 year ago

# collect pre step information
pos_1 = self.object_pos_at_time(self.get_time(), "robot")

# step
done = super().step({'robot': action})

# collect post step information
pos_2 = self.object_pos_at_time(self.get_time(), "robot")

# compute reward
com_1 = np.mean(pos_1, 1)
com_2 = np.mean(pos_2, 1)
reward = (com_2[1] - com_1[1])*10 - abs(com_2[0] - com_1[0])*5

https://github.com/EvolutionGym/evogym/blob/9a1a5e7b26702184821e6e64587220ead2ab0e21/evogym/envs/jump.py#LL36-L54C71

When jumping up, the reward is positive, but when falling, the reward is negative. Thus the cumulative reward is 0 when landing. Nothing will be learned. The optimal policy is to jump up before the end of an episode and reach the highest point exactly at the end.

drdh commented 1 year ago

Perhaps the possible explanation is that each time it will jump higher and higher.

jagdeepsb commented 1 year ago

Thanks for the comment. It seems that this reward was sufficient to learn a good jumping behavior, but perhaps is not ideal/optimal given what you have pointed out. If you find that something else works better feel free to let us know.

drdh commented 1 year ago

Thanks. It is indeed sufficient to learn a good policy, but the policy might not be stable during training. Perhaps more investigations are needed.

EvolutionGym / evogym

The reward function of 'Jumper-v0' might be incorrect #24