Undocumented clamp behavior in mlagents-learn PPO trainer

hukaidong commented 5 years ago

Recently when I implemented an experiment scene, I find that actions in ml-agent learned NN was clamped in the range of [-1, 1] although I have no explicit clamp in my code. I find no line document this behavior. Controversially, many examples in this repository explicit clamp their actions before applying physics, which could be quite misleading for developers to have a sense of "action vector could have value greater than 1".

After a short period of investigate, I find that the clamping behavior came from a recent merge at https://github.com/Unity-Technologies/ml-agents/pull/649 . Shall we make this clamping behavior documented any where in the project? Moreover, make it to be optional would also be welcomed, since there are amount of scenarios which do not have explicit action bounds. Arbitrary bounds assignments should potentially be harmful to the learning result.

awjuliani commented 5 years ago

Hi @hukaidong

This is indeed documented here: https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Design-Agents.md#continuous-action-space

By default the output from our provided PPO algorithm pre-clamps the values of vectorAction into the [-1, 1] range. It is a best practice to manually clip these as well, if you plan to use a 3rd party algorithm with your environment.

This clipping greatly improves stability when training using PPO and other policy gradient algorithms, and therefore it is included by default.

hukaidong commented 5 years ago

Uh, I see. This makes me more clear about the code. Thanks for the answer!

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Unity-Technologies / ml-agents

Undocumented clamp behavior in mlagents-learn PPO trainer #1915