[Proposal] Introduce unified way how to track agent quality

richard-hajek commented 1 year ago

Proposal

I would very much appreciate there to be a unified way to find out how well did agent do, in a predefined range for all environments. For example, a method that returns 1 on "perfect agent" and 0 on "noop-agent" or the worst agent.

Motivation

I am a student at CTU and I'm trying to learn reinforcement learning. This would be my 3rd attempt, both attempts before were unsuccessful. Besides personal reasons, something that is regularly counter-intuitive to me and took me some time to get over is that there is no "solved :fireworks: :fireworks: :partying_face: " fanfare for the agent.

The success looks like agent getting evaluation score at -20 instead of a -100 score, which is extremely underwhelming and not unified across environments. (What is this environments best score? Is it -20? Is it -10? Or perhaps +100? There is no way to programmatically find out and I gotta check the docs)

For example, this is output from a stable baselines training progress:

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 70.4     |
|    ep_rew_mean      | -70.4    |
|    exploration rate | 0.05     |
| time/               |          |
|    episodes         | 4884     |
|    fps              | 704      |
|    time_elapsed     | 283      |
|    total timesteps  | 199689   |
| train/              |          |
|    learning_rate    | 0.01     |
|    loss             | 0.0816   |
|    n_updates        | 49672    |
----------------------------------

Is -70 good? Is -70 bad? I have to actively think about it, which is putting more mental burden on coders. A single number showing "0.99" would be just amazing.

Pitch

Introduce an API to gymnasium.core.Env to facilitate evaluation of agents. With the following signature:

class Env(...):

   ...

    def rate(self) -> float:
        """
        Rate method would rate the last run on a scale of 0 to 1, where 0 is the worst run possible,
        and 1 is a perfect agent that took all the right steps. 

        :return: A float representing the rating of the last run.
        :raises NotImplementedError: If the method is not implemented by the environment author.
        :raises RuntimeError: If the env hadn't finished yet
        """
        raise NotImplementedError

When rate() returns > 0.95, the agent can be considered "good enough" and the environment "reasonably solved". Both of which definitions are up to the environment author.

Alternatively, the default implementation of rate could just be last_total_return / best_possible_return with some math magic to account for negative rewards.

Alternatives

Any alternative has the issue that newcomers don't expect this to even be an issue. As a newcomer, I expect to be able to see, trivially, if the last agent "did well". Not to code an entire patchwork of approaches to approximate this. With that being said

Ad-hoc computation: :heavy_plus_sign: Would be flexible :heavy_minus_sign: Would be a huge function, with definitions of rate() for each environment :heavy_minus_sign: Wouldn't be integrated into other frameworks ( Such as SB3 ) :heavy_minus_sign: Would break if the environment gets updated with a different reward system

RewardWrapper: :heavy_plus_sign: Universal for all environments :heavy_minus_sign: Cannot be accurate, at best it can tell how well did the agent perform compared to other agents

Gymnasium-Robotics/GoalEnv: :heavy_minus_sign: Enforces different observations :heavy_minus_sign: Is not installed by default :heavy_minus_sign: Doesn't actually help, when the environment I am learning on doesn't extend GoalEnv

Additional context

I seem to remember that environments had some kind of reward_range property, which would alleviate some of these issues, but I can't find it, perhaps I hallucinated it or it was removed?

Additionally, it is my ultimate goal to rate the agents in a human-friendly way. I do not care about this specific implementation.

Checklist

[X] I have checked that there is no similar issue in the repo

Kallinteris-Andreas commented 1 year ago

1) There already exists a reward_thre shold for some environments 2) not all environments have a best_possible_return (or in many cases it is not known)

richard-hajek commented 1 year ago

Oh you're right, I didn't think to check the registry for this information, thanks! You may close the issue then

Kallinteris-Andreas commented 1 year ago

@richard-hajek you can close you own issues https://docs.github.com/en/issues/tracking-your-work-with-issues/closing-an-issue

richard-hajek commented 1 year ago

I am perfectly aware, but sometimes maintainers want issues to stay open for sometime to allow for more discussion.

Farama-Foundation / Gymnasium