Question regarding SubprocVectorEnv failure

liuzuxin commented 1 year ago

Hi, when I try to use the evaluation script on a headless machine (cloud server) with A10G GPU, I occasionally come across the following error:

Process Process-1:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/venv.py", line 222, in _worker
    env = env_fn_wrapper.data()
  File "peft/evaluate.py", line 35, in <lambda>
    [lambda: OffScreenRenderEnv(**env_args) for _ in range(env_num)])
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/env_wrapper.py", line 161, in __init__
    super().__init__(**kwargs)
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/env_wrapper.py", line 56, in __init__
    self.env = TASK_MAPPING[self.problem_name](
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/problems/libero_tabletop_manipulation.py", line 40, in __init__
    super().__init__(bddl_file_name, *args, **kwargs)
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/bddl_base_domain.py", line 135, in __init__
    super().__init__(
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/environments/manipulation/manipulation_env.py", line 162, in __init__
    super().__init__(
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/environments/robot_env.py", line 214, in __init__
    super().__init__(
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/environments/base.py", line 143, in __init__
    self._reset_internal()
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/bddl_base_domain.py", line 735, in _reset_internal
    super()._reset_internal()
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/environments/robot_env.py", line 510, in _reset_internal
    super()._reset_internal()
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/environments/base.py", line 296, in _reset_internal
    render_context = MjRenderContextOffscreen(self.sim, device_id=self.render_gpu_device_id)
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/utils/binding_utils.py", line 210, in __init__
    super().__init__(sim, offscreen=True, device_id=device_id, max_width=max_width, max_height=max_height)
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/utils/binding_utils.py", line 78, in __init__
    self.gl_ctx = GLContext(max_width=max_width, max_height=max_height, device_id=self.device_id)
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/renderers/context/egl_context.py", line 136, in __init__
    self._context = EGL.eglCreateContext(EGL_DISPLAY, config, EGL.EGL_NO_CONTEXT, None)
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/OpenGL/platform/baseplatform.py", line 415, in __call__
    return self( *args, **named )
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/OpenGL/error.py", line 230, in glCheckError
    raise self._errorClass(
OpenGL.raw.EGL._errors.EGLError: EGLError(
        err = EGL_BAD_ALLOC,
        baseOperation = eglCreateContext,
        cArguments = (
                <OpenGL._opaque.EGLDisplay_pointer object at 0x7eff68f41640>,
                <OpenGL._opaque.EGLConfig_pointer object at 0x7eff68f41540>,
                <OpenGL._opaque.EGLContext_pointer object at 0x7eff8a264b40>,
                None,
        ),
        result = <OpenGL._opaque.EGLContext_pointer object at 0x7eff68f41a40>
)
Exception ignored in: <function EGLGLContext.__del__ at 0x7eff8a1461f0>
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/renderers/context/egl_context.py", line 155, in __del__
    self.free()
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/renderers/context/egl_context.py", line 146, in free
    if self._context:
AttributeError: 'EGLGLContext' object has no attribute '_context'
Exception ignored in: <function MjRenderContext.__del__ at 0x7eff8a1463a0>
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/site-packages/robosuite/utils/binding_utils.py", line 198, in __del__
    self.con.free()
AttributeError: 'MjRenderContextOffscreen' object has no attribute 'con'
Traceback (most recent call last):
  File "lifelong/evaluate.py", line 239, in main
    env.reset()
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/venv.py", line 702, in reset
    ret_list = [self.workers[i].recv() for i in id]
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/venv.py", line 702, in <listcomp>
    ret_list = [self.workers[i].recv() for i in id]
  File "/home/ubuntu/1_repo/LIBERO/libero/libero/envs/venv.py", line 428, in recv
    result = self.parent_remote.recv()
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/ubuntu/anaconda3/envs/libero/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

Sometimes I came across this issue due to insufficient CUDA memory; however, now even with enough memory, I still encounter this problem and have no idea how to solve it. I can use the evaluation script with DummyVectorEnv, but it seems to be too slow. So I am wondering whether you have encountered similar issues? Any hints would be appreciated. Thanks in advance.

HeegerGao commented 1 year ago

Hi, @liuzuxin, thanks for your asking! I guess this problem may have something to do with the Cuda version or something. Could you please provide more information about your machine? (The platform, NVIDIA driver version, cuda version). I didn't encounter this problem with Ubuntu 20.04 on A100 and Nvidia driver=515.105.01 and cuda=11.7.

liuzuxin commented 1 year ago

Thanks for your reply. Sure, mine is Ubuntu 20.04 with Nvidia driver 525.125.06 and cuda 12.0. I tried downgrading the driver to 470.199.02 and cuda to 11.4, and the SubprocVectorEnv works. The most strange thing is that I have been successfully using the evaluation script with Nvidia 525 drivers in the past week, but it suddenly broke without upgrading any packages. In other words, after I ran the evaluation script with SubprocVectorEnv successfully, I used the same command again, but it didn't work. So I am curious about what would be the root cause of this problem.

Cranial-XIX commented 1 year ago

Hi zuxin,

Thanks for asking. We have also noticed this issue and are investigating it. In the meantime, a quick walkaround will be saving the model offline, then you can start multiple evaluation scripts with a single environment for evaluation. This will definitely increase the GPU memory requirement but can make the evaluation faster.

MMittenbuehler commented 11 months ago

Hi, Is a solution available that does not involve downgrading the Nvidia driver and cuda version? I still encounter this problem with driver 525.60.13 and cuda 12.0. Thanks!

lihenglin commented 10 months ago

I resolve the problem by adding these two lines to venv.py.

if multiprocessing.get_start_method(allow_none=True) != "spawn":  
    multiprocessing.set_start_method("spawn", force=True)

JamesSand commented 8 months ago

I resolve the problem by adding these two lines to venv.py.
if multiprocessing.get_start_method(allow_none=True) != "spawn":  
    multiprocessing.set_start_method("spawn", force=True)

I encountered the same issue, and this solution works for me. Thank you very much!!!

74284853 commented 4 months ago

I resolve the problem by adding these two lines to venv.py.
if multiprocessing.get_start_method(allow_none=True) != "spawn":  
    multiprocessing.set_start_method("spawn", force=True)
I encountered the same issue, and this solution works for me. Thank you very much!!!

May I ask which line of env.py should I add it to? @lihenglin @JamesSand

Lifelong-Robot-Learning / LIBERO

Question regarding SubprocVectorEnv failure #3