I am trying to train off-road navigation task using stable-baselines3, and I am using SubprocVecEnv. However, after training for a while, I always encounter an EOFError with remote.recv(). And I have used check_env() and customed RL env is valid.
Could you help me look into this bug? It would be of great help to me as I have been troubled by this issue for a long time.
To Reproduce
import gymnasium as gym
from gymnasium.utils.env_checker import check_env
from sb3_plus import MultiOutputPPO, MultiOutputEnv
from stable_baselines3.common.vec_env import SubprocVecEnv
def make_env(stage: int, rank: int = 0, seed: int = 0) -> Callable:
"""
Utility function for multiprocessed env.
:param stage: (int) the terrain stage
:param rank: (int) index of the subprocess
:param seed: (int) the inital seed for RNG
:return: (Callable)
"""
def _init() -> gym.Env:
try:
env = off_road_art()
env.update_terrain_stage(stage)
env.set_nice_vehicle_mesh()
env = MultiOutputEnv(env)
env.reset(seed=seed + rank)
return env
except Exception as e:
print(f"Failed to initialize environment in subprocess {rank} with seed {seed}: {str(e)}")
raise e
set_random_seed(seed)
return _init
if __name__ == '__main__':
device = th.device("cuda" if th.cuda.is_available() else "cpu")
env_single = off_road_art()
try:
check_env(env_single, skip_render_check=True)
assert True, "Environment is valid."
except Exception as e:
assert False, f"Environment check failed: {str(e)}"
# Vectorized environment
env = make_vec_env(env_id=make_env(stage), n_envs=num_procs, vec_env_cls=SubprocVecEnv)
Relevant log output / Error message
Traceback (most recent call last):
File "/home/tong/Documents/gym_chrono/train/off_road_art_train.py", line 216, in <module>
model.learn(total_timesteps, progress_bar=True, callback=TensorboardCallback())
File "/home/tong/anaconda3/envs/chrono/lib/python3.9/site-packages/sb3_plus/mimo/ppo.py", line 323, in learn
super().learn(
File "/home/tong/anaconda3/envs/chrono/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 300, in learn
continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
File "/home/tong/anaconda3/envs/chrono/lib/python3.9/site-packages/sb3_plus/mimo/on_policy_algorithm.py", line 187, in collect_rollouts
new_obs, rewards, dones, infos = env.step(clipped_actions)
File "/home/tong/anaconda3/envs/chrono/lib/python3.9/site-packages/stable_baselines3/common/vec_env/base_vec_env.py", line 206, in step
return self.step_wait()
File "/home/tong/anaconda3/envs/chrono/lib/python3.9/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 129, in step_wait
results = [remote.recv() for remote in self.remotes]
File "/home/tong/anaconda3/envs/chrono/lib/python3.9/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 129, in <listcomp>
results = [remote.recv() for remote in self.remotes]
File "/home/tong/anaconda3/envs/chrono/lib/python3.9/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/tong/anaconda3/envs/chrono/lib/python3.9/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/tong/anaconda3/envs/chrono/lib/python3.9/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
System Info
OS: Linux-5.15.0-113-generic-x86_64-with-glibc2.31 # 123~20.04.1-Ubuntu SMP Wed Jun 12 17:33:13 UTC 2024
Python: 3.9.19
Stable-Baselines3: 2.3.2
PyTorch: 2.3.1+cu118
GPU Enabled: True
Numpy: 1.24.0
Cloudpickle: 3.0.0
Gymnasium: 0.29.1
Checklist
[X] My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
[X] I have checked that there is no similar issue in the repo
@xutong05 I had the same issue, after updating the microcode of my Intel CPU, I could train without this error occurring. Maybe it will fix your problem too. Link to Intel statement.
🐛 Bug
Hello,
I am trying to train off-road navigation task using stable-baselines3, and I am using SubprocVecEnv. However, after training for a while, I always encounter an EOFError with remote.recv(). And I have used check_env() and customed RL env is valid.
Could you help me look into this bug? It would be of great help to me as I have been troubled by this issue for a long time.
To Reproduce
Relevant log output / Error message
System Info
Checklist