cyanrain7 / TRPO-in-MARL

MIT License
186 stars 49 forks source link

muti_env_error #4

Closed wanghui589 closed 2 years ago

wanghui589 commented 2 years ago

when i run the train_mujoco.sh, the error generate: NotImplementedError Traceback (most recent call last): File "/home/spaci/RL/TRPO-in-MARL-master/scripts/train/train_mujoco.py", line 163, in main(sys.argv[1:]) File "/home/spaci/RL/TRPO-in-MARL-master/scripts/train/train_mujoco.py", line 136, in main envs = make_train_env(all_args) File "/home/spaci/RL/TRPO-in-MARL-master/scripts/train/train_mujoco.py", line 37, in make_train_env return ShareSubprocVecEnv([get_env_fn(i) for i in range(all_args.n_rollout_threads)]) File "/home/spaci/RL/TRPO-in-MARL-master/scripts/../envs/env_wrappers.py", line 360, in init self.n_agents = self.remotes[0].recv() File "/home/spaci/anaconda3/envs/env_name/lib/python3.9/multiprocessing/connection.py", line 255, in recv buf = self._recv_bytes() File "/home/spaci/anaconda3/envs/env_name/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes buf = self._recv(4) File "/home/spaci/anaconda3/envs/env_name/lib/python3.9/multiprocessing/connection.py", line 384, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer

can anyone help me? thanks

cyanrain7 commented 2 years ago

when i run the train_mujoco.sh, the error generate: NotImplementedError Traceback (most recent call last): File "/home/spaci/RL/TRPO-in-MARL-master/scripts/train/train_mujoco.py", line 163, in main(sys.argv[1:]) File "/home/spaci/RL/TRPO-in-MARL-master/scripts/train/train_mujoco.py", line 136, in main envs = make_train_env(all_args) File "/home/spaci/RL/TRPO-in-MARL-master/scripts/train/train_mujoco.py", line 37, in make_train_env return ShareSubprocVecEnv([get_env_fn(i) for i in range(all_args.n_rollout_threads)]) File "/home/spaci/RL/TRPO-in-MARL-master/scripts/../envs/env_wrappers.py", line 360, in init self.n_agents = self.remotes[0].recv() File "/home/spaci/anaconda3/envs/env_name/lib/python3.9/multiprocessing/connection.py", line 255, in recv buf = self._recv_bytes() File "/home/spaci/anaconda3/envs/env_name/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes buf = self._recv(4) File "/home/spaci/anaconda3/envs/env_name/lib/python3.9/multiprocessing/connection.py", line 384, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer

can anyone help me? thanks

I've met the same error, Maybe it caused by setting rollout thread or training thread too large, or your PC can not aford current setting of process number, or there exist many zombie process in computer system. you can restart you HPC or PC to retry it. Hope this suggestion can help you!

wanghui589 commented 2 years ago

Thank you for your answer I have modified relevant parameters, but errors will still occur. Can you help me? the parameters is as: parser.add_argument("--n_training_threads", type=int, default=1, help="Number of torch threads for training") parser.add_argument("--n_rollout_threads", type=int, default=2, help="Number of parallel envs for training rollouts")

wanghui589 commented 2 years ago

I'm learning your awesome work, but I'm having some trouble, can you help me? Actually, when i run the ShareSubprocVecEnv, the above error appeared. when i change to the 'ShareDummyVecEnv', then another error appeared: AttributeError: 'ShareDummyVecEnv' object has no attribute 'n_agents'.

cyanrain7 commented 2 years ago

Thank you for your answer I have modified relevant parameters, but errors will still occur. Can you help me? the parameters is as: parser.add_argument("--n_training_threads", type=int, default=1, help="Number of torch threads for training") parser.add_argument("--n_rollout_threads", type=int, default=2, help="Number of parallel envs for training rollouts")

You also need to modify the relevant parameters in train_mujoco.sh.