Closed drdh closed 1 year ago
Perhaps the possible explanation is that each time it will jump higher and higher.
Thanks for the comment. It seems that this reward was sufficient to learn a good jumping behavior, but perhaps is not ideal/optimal given what you have pointed out. If you find that something else works better feel free to let us know.
Thanks. It is indeed sufficient to learn a good policy, but the policy might not be stable during training. Perhaps more investigations are needed.
https://github.com/EvolutionGym/evogym/blob/9a1a5e7b26702184821e6e64587220ead2ab0e21/evogym/envs/jump.py#LL36-L54C71
When jumping up, the reward is positive, but when falling, the reward is negative. Thus the cumulative reward is 0 when landing. Nothing will be learned. The optimal policy is to jump up before the end of an episode and reach the highest point exactly at the end.