[Proposal] Make ignoring test warnings more flexible

dm-ackerman commented 9 months ago

Proposal

The test suite throws warnings for env behaviours that are generally frowned upon, but are expected and correct behaviour for some envs. I want to give envs the ability to include information on which warnings to suppress for that env. The test suite will know from the env what is an intentional part of the design so it won't complain about it.

This will make the api test more useful to env developers. It will also document that a non-standard behaviour is intentional.

This would be a completely optional feature. All envs would work as is without change.

Motivation

Many included envs raise warnings during pytest despite correct behaviour. Unless these are suppressed, the test will always complain about those envs. Currently that suppression is centrally tracked via hard-coded lists in the test suite.

This is an issue for a couple reasons:

Any change to an env that changes the version or warnings requires also changing the test suite lists - doable, but not ideal
This method doesn't work for custom envs because it requires editing the pettingzoo repo for each env. This has been complained about on discord.
The same code can give different behaviour. If a new user copies and pastes an existing env (or code from one), they may see their code suddenly gives warnings on api_test even though the original does not. That shouldn't happen. It can be confusing/frustrating to a new user.

Pitch

There are 3 parts to this:

In the test suite, add custom warnings for items we want to be ignorable:

class PZWarning(Warning):
    msg = "Generic Petting Zoo warning"

    def __str__(self) -> str:
        return self.warning

class ObservationNotNumPyWarning(PZWarning):
    warning = "Observation is not a NumPy array"

These will be used in place of the current generic user warnings. The warning given by the test code for this class:

PettingZoo/pettingzoo/test/api_test.py:139: pz_warnings.ObservationNotNumPyWarning: Observation is not a NumPy array

In an env, add an optional function that is called by the test suite to set filters to ignore warnings that the env is expected to throw.

import warnings
from pettingzoo.test import pz_warnings

class SomeEnv:
    ...
    def set_pz_test_filters(self) -> None:
        """
        This function is only used by the test suite and may be omitted.
        The purpose is to suppress certain warnings that are expected for
        this environment. In this case, the observations in the environment
        may be all zero. This raises a warning because in some environments
        that is not normal and a sign of a problem. But it is normal for this
        environment, so we set a filter to ignore it.
        Warnings should only be ignored if the behaviour is an intentional
        part of the environment design. A comment explaining why the warning
        is ignored should be included.
        """
        # initial observation for this env is all zeros, so this warning is spurious
        warnings.simplefilter("ignore", pz_warnings.ObservationAllZerosWarning)

In the test code, call env.set_pz_test_filters(), if it exists, to set filters. This will add the filters from the env and reset them after this test. The erroneous warnings will be ignored.

def api_test(env, num_cycles=1000, verbose_progress=False):
    ...
    with warnings.catch_warnings():
        try:
            env.set_pz_test_filters()
        except AttributeError:  # no filter function defined, that's ok
            pass

    ... rest of current test code ...

Alternatives

Leave the centrally located suppression lists for repo envs, and leave users to suppress warnings on their own in their custom envs. pros: easier cons: repo envs and custom envs behave differently. This puts an added barrier to a new env developer. (see also motivation section for other cons)
Put warning filters at the top of the env files so they are always active. pros: prevents adding a function to the env code. avoids the need for item 3 in pitch section. cons: always applies filters even during non-tests. This is a global operation which will trigger a bug where the wrong filters are used if more than one env is loaded in the same session (plausible/likely during development; i think it also happens during a couple tests).
Instead of the function (item 2 of pitch), put the names of the warnings to ignore in the metadata (see example below). All warnings in this list will be ignored by the test suite. pros: cleaner in the sense that it prevents extra executable code. prevents additional imports in envs cons: it is less flexible and might be limiting if there is a custom filter a developer wants to add (maybe this is ok?)

In this approach, an env author would only need to add the following to the metadata to suppress a warning - no imports and no extra functions needed:

    metadata = {
        ... ,
        "warning_filters": [
            "ObservationAllZerosWarning",  # <--- ignore this warning
        ],
    }

Instead of defining the warnings (item 1 of pitch), use regexs. pros: don't need to define new warnings cons: if the warning text changes, these break. Also, a too general regex can incorrectly catch other warnings, especially when the messages are similar.

Additional context

I can make these changes, but there was some pushback to the idea on discord so i want to reach a consensus on the best approach to addressing the problems described in the motivation section.

Checklist

[X] I have checked that there is no similar issue in the repo

elliottower commented 9 months ago

I'd have to read this through in more details to properly understand it but my first thought is that I know Gymnasium handles certain errors in their envs in testing just by making a list of errors to catch and doing it in pretty simple way (link), my preference would be to make our testing line up closer with how they do it, rather than developing a completely new system like this. That approach means there is nothing extra in the environment files and the warnings and such all is contained in the test runner files (test_envs.py), which are something that can be adapted to new repositories.

If a given user wants to hard code things into their env in order to ignore certain warnings they can, but I think by default the idea is that we want the libraries to err on the side of warning too much, rather than warning too little--they are just warnings which can be ignored if they aren't relevant to you.

Having these sorts of things hard coded in the API test or seed test is not the best design (I believe there is some of this in PettingZoo's repo, don't see anything in Gymnasium which is good)

There's a lot of quality of life things that Gymnasium does that I feel like would be more useful to have for new users, as compared to things like this which IMO is just an extra level of abstraction to learn and potentially get confused by. For example this file also has some warning catching as well, and has helpful links to the Gymnasium environment creation tutorial https://github.com/Farama-Foundation/Gymnasium/blob/81151aba1ee45df2327a01dfebcdede140aac0fe/tests/utils/test_env_checker.py#L251

I've talked with Jet about it as well but IMO the biggest issue for new users is that there are two environment creation tutorials (one in the tutorials section and one in the general documentation section: https://pettingzoo.farama.org/content/environment_creation/ and https://pettingzoo.farama.org/tutorials/custom_environment/, I tried to rename them and be a bit more clear, but still could use work). If these were cleaned up and turned into more full examples (or Gymnasium even has a full repo which is a 'skeleton repo' https://github.com/Farama-Foundation/gymnasium-env-template) and linked in the error codes and stuff like that, I think that would be super useful as well, just as another example.

dm-ackerman commented 9 months ago

I'll respond to the rest later, but a quick clarification.

Users of pettingzoo envs will see no changes. For an env writers, the only change is an optional function of the form:

import warnings
from pettingzoo.test import pz_warnings

    def set_pz_test_filters(self) -> None:
        # initial observation for this env is all zeros, so this warning is spurious
        warnings.simplefilter("ignore", pz_warnings.ObservationAllZerosWarning)

All the rest is internal changes to the test suite.

dm-ackerman commented 9 months ago

re: gymnasium handling. Looks like they ignore all warnings of a specific type. That goes against erring on the side of warning too much. I'd think a warning should be issued unless the env author has explicitly said it's by design. This approach adds a standard way for authors to do so.

I disagree with the test code tracking which warnings to ignore for each env (the current approach) for the reasons given in the motivation above. As a concrete example, say I want to change knights_archers_zombies so zombies act differently: I copy the knights_archers_zombies, rename it 'knights_archers_decepticons' and change nothing else. I run the api_test on the original - no warnings, great!. When I run it on my identical (but renamed) env it complains ???

Original, no issues:

>>> from pettingzoo.test import api_test
>>> from pettingzoo.butterfly import knights_archers_zombies_v10
>>> kaz_env = knights_archers_zombies_v10.env()
>>> api_test(kaz_env, num_cycles=1000, verbose_progress=False)
Starting API test
Passed API test

Renamed, but otherwise identical, code gives problems:

>>> from pettingzoo.test import api_test
>>> import knights_archers_decepticons_v0
>>> kad_env = knights_archers_decepticons_v0.env()
>>> api_test(kad_env, num_cycles=1000, verbose_progress=False)
Starting API test
/opt/home/code/PettingZoo/pettingzoo/test/api_test.py:184: UserWarning: The observation contains negative numbers
and is in the shape of a graphical observation. This might be a bad thing.

Should I get different results from identical code? That's not a good experience and will baffle a new env author. How does an env author fix this? By either editting the PZ repo test code or adding filters that aren't in the original env. If I'm a new env author, both of those are confusing and not obvious.

I agree that there are other QOL items that can be improved. And I agree it'd be nice to avoid putting test items in the env file. However... the api test is a useful and important part of the project. The documentation recommends using it. There have been several recent discord posts about struggles with the test system (including one complaining about this exact issue). Adding a completely optional feature that requires less than 10 lines of code to use and gives a documented and consistent way to improve to the user's test experience doesn't seem unreasonable.

EDIT: I modified the 'alternatives' section in the original proposal to add an example of how this could be placed in the metadata. It requires no imports in env files, requires less code to use, and is possibly less intrusive to an env.

Farama-Foundation / PettingZoo