Eclectic-Sheep / sheeprl

Distributed Reinforcement Learning accelerated by Lightning Fabric
https://eclecticsheep.ai
Apache License 2.0
300 stars 29 forks source link

Dreamer v1 and v2 produce error from release 2.0 and later #79

Closed HiddeLekanne closed 1 year ago

HiddeLekanne commented 1 year ago

Same command (python sheeprl.py dreamer_v1 --env_id dmc_walker_run --checkpoint_every 5000 ) works on V0.1 but from V0.2 and onwards it produces this error:

Killed

Process Worker<AsyncVectorEnv>-3:
Traceback (most recent call last):
  File "/home/hidde/sheeprl/venv/lib/python3.10/site-packages/gymnasium/vector/async_vector_env.py", line 626, in _worker_shared_memory
    command, data = pipe.recv()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hidde/sheeprl/venv/lib/python3.10/site-packages/gymnasium/vector/async_vector_env.py", line 685, in _worker_shared_memory
    pipe.send((None, False))
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process Worker<AsyncVectorEnv>-2:
Traceback (most recent call last):
  File "/home/hidde/sheeprl/venv/lib/python3.10/site-packages/gymnasium/vector/async_vector_env.py", line 626, in _worker_shared_memory
    command, data = pipe.recv()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hidde/sheeprl/venv/lib/python3.10/site-packages/gymnasium/vector/async_vector_env.py", line 685, in _worker_shared_memory
    pipe.send((None, False))
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process Worker<AsyncVectorEnv>-1:
Traceback (most recent call last):
  File "/home/hidde/sheeprl/venv/lib/python3.10/site-packages/gymnasium/vector/async_vector_env.py", line 626, in _worker_shared_memory
    command, data = pipe.recv()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

And it repeats after that for each worker (4).

belerico commented 1 year ago

Hi @HiddeLekanne, thank you for reporting this! It seems something related to the gymnasium async vector env: do you have any specifics on the OS you're on and the versions of the packages in your env? Is this also happening for other algorithms, like dreamer-v3, or with other environments other than the dmc ones?

belerico commented 1 year ago

It may be related to #77?

HiddeLekanne commented 1 year ago

Hey @belerico,

Dreamer v3 works on V0.2. OS is: Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-33-generic x86_64).

More importantly it works on V0.1 both EGL and osmesa so I doubt it's related to #77 .

Here is my pip freeze of V0.2.2: absl-py==1.4.0 aiohttp==3.8.5 aiosignal==1.3.1 ale-py==0.8.1 annotated-types==0.5.0 anyio==3.7.1 arrow==1.2.3 async-timeout==4.0.3 attrs==23.1.0 autoflake==2.1.1 AutoROM==0.4.2 AutoROM.accept-rom-license==0.6.1 backoff==2.2.1 beautifulsoup4==4.12.2 black==23.3.0 blessed==1.20.0 box2d-py==2.3.5 cachetools==5.3.1 certifi==2023.7.22 cfgv==3.4.0 charset-normalizer==3.2.0 click==8.1.6 cloudpickle==2.2.1 cmake==3.27.2 contourpy==1.1.0 coverage==7.3.0 croniter==1.4.1 cycler==0.11.0 dateutils==0.6.12 decorator==5.1.1 deepdiff==6.3.1 distlib==0.3.7 dm-control==1.0.14 dm-env==1.6 dm-tree==0.1.8 exceptiongroup==1.1.3 Farama-Notifications==0.0.4 fastapi==0.101.1 filelock==3.12.2 fonttools==4.42.0 frozenlist==1.4.0 fsspec==2023.6.0 glfw==2.6.2 google-auth==2.22.0 google-auth-oauthlib==1.0.0 grpcio==1.57.0 gymnasium==0.29.0 h11==0.14.0 identify==2.5.26 idna==3.4 imageio==2.31.1 imageio-ffmpeg==0.4.8 importlib-resources==6.0.1 iniconfig==2.0.0 inquirer==3.1.3 isort==5.12.0 itsdangerous==2.1.2 Jinja2==3.1.2 kiwisolver==1.4.4 labmaze==1.0.6 lightning==2.0.7 lightning-cloud==0.5.37 lightning-utilities==0.8.0 lit==16.0.6 lxml==4.9.3 lz4==4.3.2 Markdown==3.4.4 markdown-it-py==3.0.0 MarkupSafe==2.1.3 matplotlib==3.7.2 mdurl==0.1.2 moviepy @ git+https://github.com/Zulko/moviepy.git@bc8d1a831d2d1f61abfdf1779e8df95d523947a5 mpmath==1.3.0 mujoco==2.3.7 multidict==6.0.4 mypy==1.2.0 mypy-extensions==1.0.0 networkx==3.1 nodeenv==1.8.0 numpy==1.25.2 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 oauthlib==3.2.2 opencv-python==4.8.0.76 ordered-set==4.1.0 packaging==23.1 pathspec==0.11.2 Pillow==10.0.0 platformdirs==3.10.0 pluggy==1.2.0 pre-commit==2.20.0 proglog==0.1.10 protobuf==4.24.0 psutil==5.9.5 pyasn1==0.5.0 pyasn1-modules==0.3.0 pydantic==2.1.1 pydantic_core==2.4.0 pyflakes==3.1.0 pygame==2.5.1 Pygments==2.16.1 PyJWT==2.8.0 PyOpenGL==3.1.7 pyparsing==3.0.9 pytest==7.3.1 pytest-cov==4.1.0 pytest-cover==3.0.0 pytest-coverage==0.0 pytest-timeout==2.1.0 python-dateutil==2.8.2 python-dotenv==1.0.0 python-editor==1.0.4 python-multipart==0.0.6 pytorch-lightning==2.0.7 pytz==2023.3 PyYAML==6.0.1 readchar==4.0.5 requests==2.31.0 requests-oauthlib==1.3.1 rich==13.5.2 rsa==4.9 ruff==0.0.284 scipy==1.11.1 sheeprl @ file:///home/hidde/sheeprl Shimmy==0.2.1 six==1.16.0 sniffio==1.3.0 soupsieve==2.4.1 starlette==0.27.0 starsessions==1.3.0 swig==4.1.1 sympy==1.12 tensorboard==2.14.0 tensorboard-data-server==0.7.1 tensordict==0.1.2 toml==0.10.2 tomli==2.0.1 torch==2.0.1 torchmetrics==1.0.3 tqdm==4.66.1 traitlets==5.9.0 triton==2.0.0 typing_extensions==4.7.1 urllib3==1.26.16 uvicorn==0.23.2 virtualenv==20.24.3 wcwidth==0.2.6 websocket-client==1.6.1 websockets==11.0.3 Werkzeug==2.3.7 yarl==1.9.2

HiddeLekanne commented 1 year ago

Now I realize that I maybe have a duplicate version of sheeprl installed accidently... Going to check for that.

belerico commented 1 year ago

If that doesn't work, could you try to downgrade both mujoco to 2.3.3 and dm_control to 1.0.11 and see what happens?

HiddeLekanne commented 1 year ago

There was a duplicate sheeprl of V0.1 but after uninstalling, deleting the build and venv it still gives me the same error.

HiddeLekanne commented 1 year ago

After python -m pip install mujoco==2.3.3 dm_control==1.0.11 it still gives me the same error

belerico commented 1 year ago

What happens if you try to pass the --sync_env=True through the CLI?

HiddeLekanne commented 1 year ago
python sheeprl.py dreamer_v1 --env_id dmc_walker_run --checkpoint_every 5000 --sync_env=True

/home/hidde/sheeprl/venv/lib/python3.10/site-packages/torchmetrics/utilities/imports.py:24: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  _PYTHON_LOWER_3_8 = LooseVersion(_PYTHON_VERSION) < LooseVersion("3.8")
/home/hidde/sheeprl/venv/lib/python3.10/site-packages/torchmetrics/utilities/imports.py:24: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  _PYTHON_LOWER_3_8 = LooseVersion(_PYTHON_VERSION) < LooseVersion("3.8")
/home/hidde/sheeprl/sheeprl/cli.py:23: UserWarning: This script was launched without the Lightning CLI. Consider to launch the script with `lightning run model ...` to scale it with Fabric
  warnings.warn(
INFO: Global seed set to 42
INFO:lightning.fabric.utilities.seed:Global seed set to 42
WARNING: Missing logger folder: logs/dreamer_v1/2023-08-16_22-39-58/dmc_walker_run_default_42_1692218398
WARNING:lightning.fabric.loggers.tensorboard:Missing logger folder: logs/dreamer_v1/2023-08-16_22-39-58/dmc_walker_run_default_42_1692218398
/home/hidde/sheeprl/venv/lib/python3.10/site-packages/torch/utils/tensorboard/__init__.py:4: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if not hasattr(tensorboard, "__version__") or LooseVersion(
/home/hidde/sheeprl/venv/lib/python3.10/site-packages/torch/utils/tensorboard/__init__.py:6: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  ) < LooseVersion("1.15"):
/home/hidde/sheeprl/venv/lib/python3.10/site-packages/gymnasium/core.py:297: UserWarning: WARN: env.num_envs to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.num_envs` for environment variables or `env.get_attr('num_envs')` that will search the reminding wrappers.
  logger.warn(
CNN keys: ['rgb']
MLP keys: []
Killed
HiddeLekanne commented 1 year ago

So the standard warnings, with both versions of mujoco and dmc_control. And then abruptly killed, without errors or warnings this time.

HiddeLekanne commented 1 year ago

I am running this through pytorch remote suite terminal, if that has some weird interaction with worker creation.

Edit: also doesn't matter. Unix terminal produces the same error.

belerico commented 1 year ago

Unfortunately, I don't know it. Since the problem with the async wrapper seems to be gone, have you tried to follow the howto/learn_in_dmc?

HiddeLekanne commented 1 year ago

@belerico Yes ofcourse, I have it working for V0.1, like, complete runs on dreamerV1, it is only after the update that it produces that crash.

HiddeLekanne commented 1 year ago

It did a performance check, it seems to be ram, which fills up completely, is it correct that 64GB is not enough anymore?

belerico commented 1 year ago

Yeah, you're right! Then I have to check it myself but unfortunately right now I'm on vacation: I'll check it as soon as I get home. If you find anything new please keep us updated! Thank you

belerico commented 1 year ago

It did a performance check, it seems to be ram, which fills up completely, is it correct that 64GB is not enough anymore?

Can you also add --memmap_buffer=True?

HiddeLekanne commented 1 year ago

The memmap_buffer works, at least, it starts. I will keep it running for the night see if that is stable.

belerico commented 1 year ago

Good! The problem is related to the replay buffer that has to store a lot of images, as much as the --buffer_size=N parameter specifies. The --memmap_buffer transfers the buffer to shared memory, so it occupies the disk memory till a certain amount of size. You should find the mapped buffer in the checkpoint folder. Let us know if you have any news

HiddeLekanne commented 1 year ago

Welp ofcourse the GPU memory usage also increased, or at least it seems more fragmented now for some reason. Had to put the export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512' for a 12.5GB vram graphics card....