Closed Mayankm96 closed 7 months ago
Multi objective gymnasium already has support for a reward vector
Why do you need non-bool termination?, note we already have info
which can include termination components
also, can you check your email inbox (***@ethz.ch) thanks
Based on here, I see it does extend the array type to become a tensor (num_envs, num_rewards). However, we want to use dictionaries because they are easier/clearer to handle in the downstream learning frameworks. This, of course, depends on your application and preferences. That's why in Orbit, at least for observations, we let users decide whether they want to return dicts or a vector. We'd like to support the same for rewards and terminations.
Why do you need non-bool termination?, note we already have info which can include termination components
For multi-agent RL. You want to give separate bool signals to tell which agent did something wrong in a decentralized learning setup. The principle here for us go in the same direction as observations, i.e. have "groups" of obs/rewards/terminations.
Ideally, we would like to maintain a single inheritance of gym.VecEnv
/ gym.Env
to avoid confusion. But from what I see in the MOVecEnv definition, inherited classes are free to define the spaces for all the signals based on their requirements. If that's the recommended path, then it's alright. We can handle the dict/box spaces internally in our framework's RL class.
No, I wasn't aware of Petting Zoo. But based on the documentation, it seems very close to the proposal above (dict of rewards and terminations).
I'll go with 2 for the framework. Would be happy to engage in further discussions to keep "one" base Env definition for all these applications.
Gymnasium is intended for only single-agent RL environments, allowing other projects to specialise for multi-objective / multi-reward and multi-agent. Therefore, we will almost certainly not be adding the features in this project
Maybe I am missing out on the consequences of changing the default type specs of the reward_range to a more general type.
Nonetheless, if it is fixated only on single-agent RL, then this discussion seems moot. Thanks for the help.
Thanks for understanding @Mayankm96, it sounds like an interesting project, good luck with it
Proposal
To support more algorithmic works (multi-agent RL, multi-critic learning), it would be great to extend the support of rewards and terminations beyond the box range and bool, respectively.
Motivation
No response
Pitch
Gymnasium natively supports dict spaces for observations. Follow the same for other MDP signals.
Alternatives
No response
Additional context
Since Python is flexible with its type-hinting, nothing stops users from having a dict of rewards and terminations that they handle with their learning library. However, it would be nice to support it more explicitly.
Checklist