Farama-Foundation / Gymnasium

An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)
https://gymnasium.farama.org
MIT License
7.14k stars 792 forks source link

[Bug Report] Getting "Environment [some ID] doesn't exist" when using custom async vector env. #222

Closed sven1977 closed 10 months ago

sven1977 commented 1 year ago

Describe the bug

When running the below script (custom gymnasium.Env registered with an ID, then async-vectorized), I'm getting a gymnasium.error.NameNotFound: Environment my_env doesn't exist. error.

The full stacktrace is:

Process Worker<AsyncVectorEnv>-0:
Traceback (most recent call last):
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/vector/async_vector_env.py", line 618, in _worker_shared_memory
    env = env_fn()
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/vector/utils/misc.py", line 29, in __call__
    return self.fn()
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/vector/__init__.py", line 51, in _make_env
    env = gym.envs.registration.make(
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/envs/registration.py", line 569, in make
    _check_version_exists(ns, name, version)
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/envs/registration.py", line 219, in _check_version_exists
    _check_name_exists(ns, name)
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/envs/registration.py", line 197, in _check_name_exists
    raise error.NameNotFound(
gymnasium.error.NameNotFound: Environment my_env doesn't exist. 
Process Worker<AsyncVectorEnv>-1:
Traceback (most recent call last):
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/vector/async_vector_env.py", line 618, in _worker_shared_memory
    env = env_fn()
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/vector/utils/misc.py", line 29, in __call__
    return self.fn()
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/vector/__init__.py", line 51, in _make_env
    env = gym.envs.registration.make(
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/envs/registration.py", line 569, in make
    _check_version_exists(ns, name, version)
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/envs/registration.py", line 219, in _check_version_exists
    _check_name_exists(ns, name)
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/envs/registration.py", line 197, in _check_name_exists
    raise error.NameNotFound(
gymnasium.error.NameNotFound: Environment my_env doesn't exist. 
Traceback (most recent call last):
  File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/pydevd.py", line 1477, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/sven/Library/Application Support/JetBrains/PyCharmCE2020.3/scratches/scratch_215.py", line 25, in <module>
    env = gym.vector.make("my_env", num_envs=2, asynchronous=True)
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/vector/__init__.py", line 73, in make
    return AsyncVectorEnv(env_fns) if asynchronous else SyncVectorEnv(env_fns)
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/vector/async_vector_env.py", line 168, in __init__
    self._check_spaces()
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/vector/async_vector_env.py", line 502, in _check_spaces
    results, successes = zip(*[pipe.recv() for pipe in self.parent_pipes])
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/site-packages/gymnasium/vector/async_vector_env.py", line 502, in <listcomp>
    results, successes = zip(*[pipe.recv() for pipe in self.parent_pipes])
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/Users/sven/opt/anaconda3/envs/ray/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

Code example

import gymnasium as gym
import numpy as np

class MyEnv(gym.Env):
    def __init__(self):
        self.action_space = gym.spaces.Discrete(2)
        self.observation_space = gym.spaces.Box(0, 100, (1,), dtype=np.float32)
        self.i = 0

    def reset(self, *, seed=None, options=None):
        self.i = 0
        return self._get_obs(), {}

    def step(self, action):
        self.i += 1
        return self._get_obs(), 1.0, False, self.i >= 5, {}

    def _get_obs(self):
        return np.array([self.i], dtype=np.float32)

if __name__ == "__main__":
    gym.register("my_env", MyEnv)
    env = gym.vector.make("my_env", num_envs=2, asynchronous=True)

System info

Mac OS (laptop) python 3.8.13 gymnasium 0.26.3 gym 0.26.2 (not needed, but installed for Atari)

Additional context

No response

Checklist

pseudo-rnd-thoughts commented 1 year ago

I have copied and pasted your code and don't get an issue. Could your code be getting confused by gym and gymnasium

RedTachyon commented 1 year ago

Ooh, this one is spicy, I can actually reproduce it locally, and I realized that I lowkey had the same issue some months back, but didn't think about its wider implications.

Note: I'm not an expert on python multiprocessing, so details might be off, but I'm pretty sure this is the general idea of what's happening. You define your environment in the main Python process, and the gym.registry instance gets updated. When you create the async vector env, the process gets spawned/forked (more on this later), which essentially creates a new interpreter and reruns some of the code. It seems that this doesn't include the update to the registry, so in the child process, the new environment doesn't get registered. So even though the main thread sees everything, each child process only sees the built-in envs, so they crash.

To potentially make things even spicier (and why @pseudo-rnd-thoughts couldn't replicate it) - this might depend on the operating system. I'm also getting the error on MacOS, but a quick test on colab seems to pass without a problem. This might be related to the start methods in multiprocessing. It seems that MacOS (and Windows) uses spawn by default, while Linux uses fork. I don't know what happens with forkserver. It's too late for me to dig into it right now, but the start methods can be switched via arguments to AsyncVectorEnv (sadly unavailable through the gym.vector.make API), so we can use this to check what works on different systems.

As for the solution, the vector API is undergoing a complete rewrite at the moment, so we'll definitely have to think about what to do. When I came across this issue in the past, I used a super ugly workaround of doing the imports/registration inside on the subprocesses. Maybe it would be viable to restrict async envs to a specific start method that behaves well? We'll have to think about it. At the very least the new API should allow directly choosing the start method.

The temporary workaround could be monkey-patching the gym.vector.make function to manually select the right start method, or accessing AsyncVectorEnv directly.

tl;dr I blame the GIL (quite possibly incorrectly, but shh)

RedTachyon commented 1 year ago

I can't work on the implementation right now, but I wanted to write down my thoughts on how this can be solved.

After some reading, turns out that global variables are properly inherited for fork, but not for spawn (see this random article: https://superfastpython.com/multiprocessing-inherit-global-variables-in-python/)

In principle we could restrict it to using fork (and maybe forkserver?), but that feels like a bit of a cop-out - and I know that at least in some cases, it does actually make a difference which one you choose (not just for performance, but also whether your code will even run properly).

The best "robust" option is probably to pass the entire env registry as an argument to the async environment worker, and then use those specifications instead of directly using gym.make(env_id). The question here would be the performance impact which I cannot estimate right now, but e.g. Atari likes to register like a thousand different envs, and all of that needs to be sent between the processes. Fortunately, each env spec should be relatively light-weight, and it's a one-time cost. Then again, it's a bunch of extra memory usage for each process, so we need to profile it.

If it does turn out to be a problem, we can also consider using shared memory to distribute a single copy of the registry between different workers. This should hopefully be straight-forward, as long as multiprocessing doesn't do anything weird.

pseudo-rnd-thoughts commented 1 year ago

Could we look to include this in v0.28 experimental vector implementation?

I think the idea of a shared memory object would be the best way of doing this.

pseudo-rnd-thoughts commented 1 year ago

@sven1977 or @RedTachyon I'm looking to include a fix for this in the next releases however I'm still unable to replicate the issue on my Macbook. I expected the following code to raise this issue but it doesn't

def test_async_with_dynamically_registered_env():
    gym.register("TestEnv-v0", CartPoleEnv)

    gym.make_vec("TestEnv-v0", vectorization_mode="async")

    del gym.registry["TestEnv-v0"]
gonultasbu commented 11 months ago

I can replicate the issue on Windows 11 with the following error.

gymnasium.error.NameNotFound: Environment `my_env` doesn't exist

EDIT: asynchronous=False parameter solves the issue as implied by @RedTachyon .

pseudo-rnd-thoughts commented 11 months ago

Hi, we haven't been able to replicate the issue for our CI in order to solve this issue. Are you able to produce a small script that can replicate the issue?

gonultasbu commented 11 months ago

The code example provided in the first post of the issue does replicate the issue for my case.

pseudo-rnd-thoughts commented 11 months ago

Strangely, my laptop now raises the error using the original code (which it did not as I commented). I have made a PR that currently just adds a test to see if the CI will raise the error as expected so we can then experiment with testing

RedTachyon commented 10 months ago

Closing in favor of PR #810