[Bug]: EOFError after running for a while

xutong05 commented 3 months ago

🐛 Bug

Hello,

I am trying to train off-road navigation task using stable-baselines3, and I am using SubprocVecEnv. However, after training for a while, I always encounter an EOFError with remote.recv(). And I have used check_env() and customed RL env is valid.

Could you help me look into this bug? It would be of great help to me as I have been troubled by this issue for a long time.

To Reproduce

import gymnasium as gym
from gymnasium.utils.env_checker import check_env
from sb3_plus import MultiOutputPPO, MultiOutputEnv
from stable_baselines3.common.vec_env import SubprocVecEnv

def make_env(stage: int, rank: int = 0, seed: int = 0) -> Callable:
    """
    Utility function for multiprocessed env.

    :param stage: (int) the terrain stage
    :param rank: (int) index of the subprocess
    :param seed: (int) the inital seed for RNG
    :return: (Callable)
    """
    def _init() -> gym.Env:
        try:
            env = off_road_art()
            env.update_terrain_stage(stage)
            env.set_nice_vehicle_mesh()
            env = MultiOutputEnv(env)
            env.reset(seed=seed + rank)
            return env
        except Exception as e:
            print(f"Failed to initialize environment in subprocess {rank} with seed {seed}: {str(e)}")
            raise e

    set_random_seed(seed)
    return _init

if __name__ == '__main__':
    device = th.device("cuda" if th.cuda.is_available() else "cpu")
    env_single = off_road_art()

    try:
        check_env(env_single, skip_render_check=True)
        assert True, "Environment is valid."
    except Exception as e:
        assert False, f"Environment check failed: {str(e)}"

       # Vectorized environment
        env = make_vec_env(env_id=make_env(stage), n_envs=num_procs, vec_env_cls=SubprocVecEnv)

Relevant log output / Error message

Traceback (most recent call last):                                                                                                                
  File "/home/tong/Documents/gym_chrono/train/off_road_art_train.py", line 216, in <module>                                                       
    model.learn(total_timesteps, progress_bar=True, callback=TensorboardCallback())                                                               
  File "/home/tong/anaconda3/envs/chrono/lib/python3.9/site-packages/sb3_plus/mimo/ppo.py", line 323, in learn                                    
    super().learn(                                                                                                                                
  File "/home/tong/anaconda3/envs/chrono/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 300, in learn         
    continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)                              
  File "/home/tong/anaconda3/envs/chrono/lib/python3.9/site-packages/sb3_plus/mimo/on_policy_algorithm.py", line 187, in collect_rollouts         
    new_obs, rewards, dones, infos = env.step(clipped_actions)                                                                                    
  File "/home/tong/anaconda3/envs/chrono/lib/python3.9/site-packages/stable_baselines3/common/vec_env/base_vec_env.py", line 206, in step         
    return self.step_wait()                                                                                                                       
  File "/home/tong/anaconda3/envs/chrono/lib/python3.9/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 129, in step_wait 
    results = [remote.recv() for remote in self.remotes]                                                                                          
  File "/home/tong/anaconda3/envs/chrono/lib/python3.9/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 129, in <listcomp>
    results = [remote.recv() for remote in self.remotes]                                                                                          
  File "/home/tong/anaconda3/envs/chrono/lib/python3.9/multiprocessing/connection.py", line 250, in recv                                          
    buf = self._recv_bytes()                                                                                                                      
  File "/home/tong/anaconda3/envs/chrono/lib/python3.9/multiprocessing/connection.py", line 414, in _recv_bytes                                   
    buf = self._recv(4)                                                                                                                           
  File "/home/tong/anaconda3/envs/chrono/lib/python3.9/multiprocessing/connection.py", line 383, in _recv                                         
    raise EOFError                                                                                                                                
EOFError

System Info

OS: Linux-5.15.0-113-generic-x86_64-with-glibc2.31 # 123~20.04.1-Ubuntu SMP Wed Jun 12 17:33:13 UTC 2024
Python: 3.9.19
Stable-Baselines3: 2.3.2
PyTorch: 2.3.1+cu118
GPU Enabled: True
Numpy: 1.24.0
Cloudpickle: 3.0.0
Gymnasium: 0.29.1

Checklist

[X] My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
[X] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] I have provided a minimal and working example to reproduce the bug
[X] I've used the markdown code blocks for both code and stack traces.

araffin commented 2 months ago

If code there is, it is minimal and working

Closing because the minimum requirements for seeking help are not met.

This also looks like tech support, which we don't do.

NIvo172 commented 1 month ago

@xutong05 I had the same issue, after updating the microcode of my Intel CPU, I could train without this error occurring. Maybe it will fix your problem too. Link to Intel statement.

DLR-RM / stable-baselines3