[Question] VecEnv GPU optimizations

MetcalfeTom commented 3 years ago

Question

Are the vector envs in stable-baselines3 GPU-optimizable? I note that models can have their parameters loaded into GPU memory with the device attribute. However, during training tensors between the policy and the env undergo conversions between GPU <> CPU as well as PyTorch <> NumPy.

Additional context

For example in OnPolicyAlgorithm.collect_rollouts():

            with th.no_grad():
                # Convert to pytorch tensor
                obs_tensor = th.as_tensor(self._last_obs).to(self.device)
                actions, values, log_probs = self.policy.forward(obs_tensor)
            actions = actions.cpu().numpy()  # <--

            # Rescale and perform action
            clipped_actions = actions
            # Clip the actions to avoid out of bound error
            if isinstance(self.action_space, gym.spaces.Box):
                clipped_actions = np.clip(actions, self.action_space.low, self.action_space.high)

            new_obs, rewards, dones, infos = env.step(clipped_actions)

the actions tensor is removed from GPU and converted to a NumPy array. It seems that if there was a VecEnv that supported tensors then this step could be forgoed and the propagation could stay on the Cuda device. Unless I am misinterpreting something.

Checklist

[x] I have read the documentation (required)
[x] I have checked that there is no similar issue in the repo (required)

Miffyli commented 3 years ago

Hey. You are right this would not be a very big change (the part you pointed out + some minor-ish modifications to buffers), however there are some caveats:

1) To support both torch and numpy arrays would require modifications to environment wrappers and treatment (bunch of if-else-ing) -> code becomes harder to maintain and understand. 2) Danger of in-place modifications, or alternatively surplus amount of copying 3) The speed performance gained from all this is unknown, especially if we have to constantly create tensors in CUDA side.

If done well, e.g. shared tensors in CUDA between agent and environment, this could lead to significant speedups but I do not think stable-baselines is place for such optimizations. Would you know of any significant gain from using this method?

MetcalfeTom commented 3 years ago

Hey @Miffyli thanks for the detailed response

Agreed it could add more complexity. That said, it may be possible to handle a lot of this treatment with torch implicitly (for instance the Tensor.__array__ method, which is called when indexing, converts CPU tensors to NumPy)
In-place modifications now raise RuntimeErrors when computing gradients in torch and can be debugged with anomaly detection
Could you explain a little more about "constantly creating tensors in CUDA side"?

I don't know of any significant gain yet but will continue to experiment a little more, measuring the performance increase and amount of minimal code edits.

Miffyli commented 3 years ago

1) Indeed thanks to Torch's similarity to numpy many things would be simpler rather than harder, but there is plenty of stuff in e.g. VecEnvs which might not play well (hard to tell without trying). 2) This is more on storing data on buffers, e.g. when environment returns torch tensor and we put it into a replay/rollout buffer we might accidentally just view it, so now if that original tensor is modified the data in buffer will be modified as well. These are quite hard to catch if they occur, granted one can make tests to ensure this 3) Related to above, e.g. cloning tensors. Again, if done right we just copy data to already-allocated buffers, but SB3 was not designed super-accurate optimization in mind and there may be many spots which would lead to recreation of buffers.

TL;DR: Things might work well or not. Based on what I know right now I'd say it would be a messy addition to support, but I could be wrong ^^. If you manage to run some benchmarks on how fast things could be it would help deciding, but note that this would have to be significant (primary themes of SB are maintainability and clean code)

MetcalfeTom commented 3 years ago

Thanks for the pointers! I managed to get some benchmarks. I trained the PPO model on a custom VecEnv version of cartpole that vectorizes the step() and reset() methods. That eliminated the loop in Vecenv.step_wait(). Then as you mentioned, there were some modifications to the rollout buffers to store tensors/compute advantage as well as numerous conversions from numpy to torch operations throughout the codebase to support it. I timed the execution of PPO.learn() a high number of environments across a few different batch sizes (all runs were done on an Nvidia TITAN RTX)

which is a nice result, but these particular hyperparameters may be uncommon (particularly for cartpole). With a smaller numbers of parallel environments, the optimization is not so profound but still quicker. Average reward of the PPO was roughly the same.

The majority of the work is in building the tensor version of the environment - like you mentioned stable-baselines may not be the place for it. But it seems like It would speed up policy development.

Miffyli commented 3 years ago

@MetcalfeTom

Thanks for the results! Indeed they are bit env-specific (and odd set of hyperparameters), but points out that you could speed up such training a lot with a GPU. This could benefit training and make it stabler in general. I am not sure if it would speed up policy development in general, but could speed up experiments with libraries like megastep. Main stable-baselines would not be a place for such optimization though, but would you be able to share your code here (e.g. fork of sb3) for anybody interested in future?

@andyljones Pinging you as this might interest you.

andyljones commented 3 years ago

Thanks for the ping!

I'm a huge advocate for vectorized envs and they can certainly deliver huge speedups. Trying to crowbar it into stable-baselines though - I reckon it'd be misery, both for the implementer and the maintainer. You write a lot of things differently when you've got 10k envs to play with, and folding that kind of codepath into SB3 - well, to steal a phrase it'd be making an octopus by nailing extra legs onto a dog.

@MetcalfeTom , some other projects you might be interested in:

CuLE, which by some voodoo gets a largely-unmodified Atari emulator to run on the GPU
Isaac Gym, which is nvidia's prototype physics sim
SampleFactory, which uses some magic to get ridiculous sample rates out of CPU-only envs.
megastep and boardlaw, which are my own contributions to flatland envs and boardgames respectively.

As an aside, it's worth noting that env acceleration works best with small networks. As your net gets bigger, it comes to dominate the runtime no matter how slow your env is. Frankly, this is largely what killed my interest in megastep. It don't scale, and things that don't scale go against the Bitter Lesson.

Come drop by the RL discord if you'd like to chat about this further 🙂

araffin commented 3 years ago

Last but not least, if you are really seeking performance, you should consider implementing the RL algorithm and environment in a lower-level language, like C++ or Rust. I have done some quick experiments with Rust and Pytorch bindings (implementing both the RL agent and the env in rust), and I could get a solid 3-4x speed boost on cpu without any optimization. Also, with such languages, you could make use of true multi-threading (without the python GIL).

MetcalfeTom commented 3 years ago

Wow! Thanks all for the discussion. Seems like I've just uncovered the tip of the iceberg 🤓

Heeding the advice, I decided to publish my code to a fork here for any curious future readers. No doubt I will be continuing the discussion in other places, but will close the issue from here for now :+1:

AlessandroZavoli commented 2 years ago

Wow! Thanks all for the discussion. Seems like I've just uncovered the tip of the iceberg 🤓

Heeding the advice, I decided to publish my code to a fork here for any curious future readers. No doubt I will be continuing the discussion in other places, but will close the issue from here for now 👍

@MetcalfeTom I was trying to run your example and maybe change the dynamics to something more challenging to see if the speedup is still interesting, but I don't find anything about how to install it ... 😢

Karlheinzniebuhr commented 1 year ago

Thanks for the pointers! I managed to get some benchmarks. I trained the PPO model on a custom VecEnv version of cartpole that vectorizes the step() and reset() methods. That eliminated the loop in Vecenv.step_wait(). Then as you mentioned, there were some modifications to the rollout buffers to store tensors/compute advantage as well as numerous conversions from numpy to torch operations throughout the codebase to support it. I timed the execution of PPO.learn() a high number of environments across a few different batch sizes (all runs were done on an Nvidia TITAN RTX)

which is a nice result, but these particular hyperparameters may be uncommon (particularly for cartpole). With a smaller numbers of parallel environments, the optimization is not so profound but still quicker. Average reward of the PPO was roughly the same.

The majority of the work is in building the tensor version of the environment - like you mentioned stable-baselines may not be the place for it. But it seems like It would speed up policy development.

This fork is completely broken right now for me.

LeZheng-x commented 2 months ago

Is it possible to use Multi GPU for training ?Multi GPU training with Pytorch

DLR-RM / stable-baselines3