DLR-RM / rl-baselines3-zoo

A training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.
https://rl-baselines3-zoo.readthedocs.io
MIT License
1.98k stars 507 forks source link

Multiprocessing #184

Closed learningxiaobai closed 2 years ago

learningxiaobai commented 2 years ago

Is there any instruction to use multiprocessing in RB3 zoo, such as python train.py ----, I can't find where in train.py, so I can only call related libraries for training? similar: `env_id = "CartPole-v1" num_cpu = 4 # Number of processes to use

Create the vectorized environment

env = SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])

` Thanks

araffin commented 2 years ago

Hello, if you look at the yaml files, you will find a n_envs hyperparameter, that can also be changed on the fly:

python train.py --algo ppo --env CartPole-v1 -params n_envs:4

and the type of vec env is controlled via the --vec-env argument

learningxiaobai commented 2 years ago

Hello, if you look at the yaml files, you will find a n_envs hyperparameter, that can also be changed on the fly:

python train.py --algo ppo --env CartPole-v1 -params n_envs:4

and the type of vec env is controlled via the --vec-env argument

Thanks a lot.

learningxiaobai commented 2 years ago

Hello,When I use python train.py --algo tqc --env PandaReach-v1 -params n_envs:4,I met an error, ValueError: Error: the model does not support multiple envs; it requires a single vectorized environment. is this model not supporting multiple environments?or any other problems.Thanks a lot.

Miffyli commented 2 years ago

@learningxiaobai

As error states, the algorithm does not support multiple environments (off-policy methods like DQN, DDPG, TD3, SAC and TQC only work with one env). Close the issue if this answered the question.

araffin commented 2 years ago

does not support multiple environments (off-policy methods like DQN, DDPG, TD3, SAC and TQC only work with one env).

not yet https://github.com/DLR-RM/stable-baselines3/issues/179

(her reply buffer support will take additional time)

learningxiaobai commented 2 years ago

does not support multiple environments (off-policy methods like DQN, DDPG, TD3, SAC and TQC only work with one env).

not yet DLR-RM/stable-baselines3#179

(her reply buffer support will take additional time)

hello,when I use not her reply buffer,it didn't work as usual,such as: python train.py --algo ddpg --env HalfCheetahBulletEnv-v0 -params n_envs:4 I still met the same error, what is the problem?

araffin commented 2 years ago

I still met the same error, what is the problem?

are you using the experimental branch https://github.com/DLR-RM/stable-baselines3/pull/439 ? if not, that's normal, as @Miffyli off-policy algos on master do not support multi env training (but this will change once the PR is merged)

araffin commented 2 years ago

I added experimental support for multi env with HER in https://github.com/DLR-RM/stable-baselines3/pull/654

learningxiaobai commented 2 years ago

I added experimental support for multi env with HER in DLR-RM/stable-baselines3#654

great work!!!!!

learningxiaobai commented 2 years ago

I added experimental support for multi env with HER in DLR-RM/stable-baselines3#654

Hello, I use the branch feat/multienv-her,but When I use python train.py --algo tqc --env PandaReach-v1 -params n_envs:4,I still met the error. ValueError: Error: the model does not support multiple envs; it requires a single vectorized environment.

araffin commented 2 years ago

you need to use that branch of contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/pull/50

learningxiaobai commented 2 years ago

Hello, I met an error,Is there some wrong with her ?Thanks. (rl-baselines3-zoo-master) C:\codes\rl\rl-baselines3-zoo-master>python train.py --algo tqc --env PandaPush-v1 -params n_envs:3 ========== PandaPush-v1 ========== Seed: 1088921540 Default hyperparameters for environment (ones being tuned will be overridden): OrderedDict([('batch_size', 2048), ('buffer_size', 1000000), ('env_wrapper', 'sb3_contrib.common.wrappers.TimeFeatureWrapper'), ('gamma', 0.95), ('learning_rate', 0.001), ('n_envs', 3), ('n_timesteps', 1000000.0), ('policy', 'MultiInputPolicy'), ('policy_kwargs', 'dict(net_arch=[512, 512, 512], n_critics=2)'), ('replay_buffer_class', 'HerReplayBuffer'), ('replay_buffer_kwargs', "dict( online_sampling=True, goal_selection_strategy='future', " 'n_sampled_goal=4, )'), ('tau', 0.05)]) Using 3 environments Creating test environment pybullet build time: Nov 2 2021 15:42:29 argv[0]= C:\ProgramData\Anaconda3\envs\rl-baselines3-zoo-master\lib\site-packages\gym\logger.py:34: UserWarning: WARN: Box bound precision lowered by casting to float32 warnings.warn(colorize("%s: %s" % ("WARN", msg % args), "yellow")) argv[0]= argv[0]= argv[0]= Using cpu device Log path: logs/tqc/PandaPush-v1_2 Traceback (most recent call last): File "train.py", line 195, in <module> exp_manager.learn(model) File "C:\codes\rl\rl-baselines3-zoo-master\utils\exp_manager.py", line 202, in learn model.learn(self.n_timesteps, **kwargs) File "C:\ProgramData\Anaconda3\envs\rl-baselines3-zoo-master\lib\site-packages\sb3_contrib\tqc\tqc.py", line 299, in learn reset_num_timesteps=reset_num_timesteps, File "C:\ProgramData\Anaconda3\envs\rl-baselines3-zoo-master\lib\site-packages\stable_baselines3\common\off_policy_algorithm.py", line 375, in learn self.train(batch_size=self.batch_size, gradient_steps=gradient_steps) File "C:\ProgramData\Anaconda3\envs\rl-baselines3-zoo-master\lib\site-packages\sb3_contrib\tqc\tqc.py", line 194, in train replay_data = self.replay_buffer.sample(batch_size, env=self._vec_normalize_env) File "C:\ProgramData\Anaconda3\envs\rl-baselines3-zoo-master\lib\site-packages\stable_baselines3\her\her_replay_buffer.py", line 652, in sample samples.append(self.buffers[i].sample(int(batch_sizes[i]), env)) File "C:\ProgramData\Anaconda3\envs\rl-baselines3-zoo-master\lib\site-packages\stable_baselines3\her\her_replay_buffer.py", line 212, in sample return self._sample_transitions(batch_size, maybe_vec_env=env, online_sampling=True) # pytype: disable=bad-return-type File "C:\ProgramData\Anaconda3\envs\rl-baselines3-zoo-master\lib\site-packages\stable_baselines3\her\her_replay_buffer.py", line 295, in _sample_transitions episode_indices = np.random.randint(0, self.n_episodes_stored, batch_size) File "mtrand.pyx", line 746, in numpy.random.mtrand.RandomState.randint File "_bounded_integers.pyx", line 1338, in numpy.random._bounded_integers._rand_int32 ValueError: high <= 0

qgallouedec commented 2 years ago

I don't think it is related to multiprocessing. I suggest you to open a new issue. The model tries to sample transitions before the first episode is stored. What is the value of learning_starts? Did you change the environment in any way?

learningxiaobai commented 2 years ago

I don't think it is related to multiprocessing. I suggest you to open a new issue. The model tries to sample transitions before the first episode is stored. What is the value of learning_starts? Did you change the environment in any way?

nothing changed.just use default setting

learningxiaobai commented 2 years ago

add learning-start in PandaPush ,it works,thanks @qgallouedec