Farama-Foundation / Gymnasium-Robotics

A collection of robotics simulation environments for reinforcement learning

https://robotics.farama.org/

MIT License

485 stars 79 forks source link

Updated Dense Reward for Maze tasks #216

Closed siddarth-c closed 2 months ago

siddarth-c commented 3 months ago

Description

Updated the dense reward of Maze environments from exp(-distance) to -distance

Fixes #175

Type of change

Please delete options that are not relevant.

[ ] Bug fix (non-breaking change which fixes an issue)

Screenshots

Please attach before and after screenshots of the change if applicable.

Checklist:

[ ] I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[x] My changes generate no new warnings
[ ] I have added tests that prove my fix is effective or that my feature works
[ ] New and existing unit tests pass locally with my changes

Kallinteris-Andreas commented 3 months ago

Have you validated that the new version works as expected Also compare training performance across the 2 different versions

Thanks

siddarth-c commented 3 months ago

Sorry for the wait. I've validated the updated reward function. Episodes of the new reward are, on average, 2.5 times shorter, with goals achieved in 50 timesteps compared to the previous 120. Meaning goals are reached much faster than wandering around the goal.

https://github.com/Farama-Foundation/Gymnasium-Robotics/assets/50509572/81f43765-d9ba-4770-9915-0eac297e1601

siddarth-c commented 3 months ago

Here is a repo with the code and a few plots for understanding the behaviour. AntMaze did not learn in 1e6 time steps and I cant afford to run longer. But the difference in the behaviour is quite evident in PointMaze.

Thanks!

Kallinteris-Andreas commented 3 months ago

Your charts are wrong, for example episodic_return with "new reward" gets positive values, which is not possible as it is a sum of non positive values

also there is no indication on how many runs, were tested