Eclectic-Sheep / sheeprl

Distributed Reinforcement Learning accelerated by Lightning Fabric
https://eclecticsheep.ai
Apache License 2.0
274 stars 26 forks source link

OSError: [Errno 24] Too many open files #298

Open zichunxx opened 4 weeks ago

zichunxx commented 4 weeks ago

Hi!

I tried to store episodes with EpisodeBuffer and memmap=True to release RAM pressure but met this error:

File "/home/xzc/Documents/dreamerv3-torch/test/buffer.py", line 92, in test_max_buffer_szie
    rb.add(episode)
  File "/home/xzc/miniforge3/envs/dreamerv3/lib/python3.9/site-packages/sheeprl/data/buffers.py", line 968, in add
    self._save_episode(self._open_episodes[env])
  File "/home/xzc/miniforge3/envs/dreamerv3/lib/python3.9/site-packages/sheeprl/data/buffers.py", line 1024, in _save_episode
    episode_to_store[k] = MemmapArray(
  File "/home/xzc/miniforge3/envs/dreamerv3/lib/python3.9/site-packages/sheeprl/utils/memmap.py", line 67, in __init__
    self._array = np.memmap(
  File "/home/xzc/miniforge3/envs/dreamerv3/lib/python3.9/site-packages/numpy/core/memmap.py", line 267, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
OSError: [Errno 24] Too many open files
Exception ignored in: <function MemmapArray.__del__ at 0x7fc7cc249700>
Traceback (most recent call last):
  File "/home/xzc/miniforge3/envs/dreamerv3/lib/python3.9/site-packages/sheeprl/utils/memmap.py", line 220, in __del__
    if self._array is not None and self._has_ownership and getrefcount(self._file) <= 2:
  File "/home/xzc/miniforge3/envs/dreamerv3/lib/python3.9/site-packages/sheeprl/utils/memmap.py", line 236, in __getattr__
    raise AttributeError(f"'MemmapArray' object has no attribute '{attr}'")
AttributeError: 'MemmapArray' object has no attribute '_array'

For traceback, the following minimal code snippet can reproduce the error

import numpy as np
from sheeprl.data.buffers import EpisodeBuffer, ReplayBuffer
from sheeprl.utils.memmap import MemmapArray
import gymnasium as gym
from gymnasium.experimental.wrappers import PixelObservationV0

buf_size = 1000000
sl = 5
n_envs = 1
obs_keys = ("observation",)
rb = EpisodeBuffer(
    buf_size,
    sl,
    n_envs=n_envs,
    obs_keys=obs_keys,
    memmap=True,
    memmap_dir="", 
)
env = PixelObservationV0(gym.make("Walker2d-v4", render_mode="rgb_array", width=100, height=100), pixels_only=True)
keys = ("observation", "reward", "terminated", "truncated")
episode = {k: [] for k in keys}
steps = 0
obs, info = env.reset()
image_shape = obs.shape
while True:
    if steps % int(1000) == 0:
        print("current steps: {}".format(steps))
    observation, reward, terminated, truncated, info = env.step(env.action_space.sample())
    episode["observation"].append(observation)
    episode["reward"].append(reward)
    episode["terminated"].append(terminated)
    episode["truncated"].append(truncated)

    if terminated or truncated:
        episode_length = len(episode["observation"])
        episode["observation"] = np.array(episode["observation"]).reshape(episode_length, 1, *image_shape)
        episode["reward"] = np.array(episode["reward"]).reshape(episode_length, 1, -1)
        episode["terminated"] = np.array(episode["terminated"]).reshape(episode_length, 1, -1)
        episode["truncated"] = np.array(episode["truncated"]).reshape(episode_length, 1, -1)
        rb.add(episode)
        episode = {k: [] for k in keys}
        env.reset()

    steps += 1

, where memmap_dir should be given.

Could you please tell me what causes this problem?

Many thanks for considering my request.

Update:

I found this problem seems to be triggered by saving too many episodes on disk. (Please correct me if I'm wrong)

I tried with EpisodeBuffer because image observation almost consumes all RAM (64GB) during training, especially with frame stack. I want to complete this training process without upgrading the hardware. So I want to relieve the pressure on RAM with memmap=True but encounter the above problem. Any advice on this problem?

Thanks in advance.

belerico commented 3 weeks ago

Hi @zichunxx, I will have a look in the next few days after some deadlines. Thank you

belerico commented 3 weeks ago

Have you tried with another buffer, like the standard ReplayBuffer or the SequentialReplayBuffer? Does it give you the same error?

zichunxx commented 3 weeks ago

Hi @zichunxx, I will have a look in the next few days after some deadlines. Thank you

No problem! I will try to fix it before you are done with your deadline.

Have you tried with another buffer, like the standard ReplayBuffer or the SequentialReplayBuffer? Does it give you the same error?

I have tried with ReplayBuffer and there is no OSError. The above error seems to be triggered by too many .memmap files generated on disk.

belerico commented 2 weeks ago

Hi @zichunxx, I tried yesterday on my machine and reached more than 200k steps without errors: how many steps can you print before the error is raised? PS I had to stop the experiment because I was running out of space on hdisk

zichunxx commented 2 weeks ago

Hi! The above error is triggered with 5000 steps and a buffer size 4990. Besides, I found this error only occurred when I ran the above program in the system terminal with conda env activated. If I tried this in the VSCode terminal, this error would not happen in 5000 steps, which bothered me a lot.