DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
9.09k stars 1.7k forks source link

[Question] I do not understand the GPU and memory usage of SB3 #1630

Open EloyAnguiano opened 1 year ago

EloyAnguiano commented 1 year ago

❓ Question

I think I do not underestand the memory usage of SB3. I have a Dict observation space of some huge matrixes, so my observation space is 17MB approx:

(Pdb) [sys.getsizeof(v) for k, v in obs.items()]
[2039312, 2968, 12235248, 105800, 2968, 2968, 2968, 2039312, 116, 2039312, 2968, 2968, 2968]
(Pdb) sum([sys.getsizeof(v) for k, v in obs.items()])/1024/1024
17.623783111572266

I training a PPO agent over a Vectorized environment with the make_vec_env function at n_envs = 2 and the hyperparameters of my PPO agent are n_steps = 6 and my batch_size is 16. If I underestood well, my rollout buffer will be n_steps x n_envs = 12 so the rollout_buffer will be 17 x 12 = 204 MB. I assume that the batch_size of 16 will get the minimum so it is equivalent of having a batch size of 12.

The problem here is that when I'm using a GPU device (80GB A100) it stabilizes at 70GB of usage at the beginning and a little bit later it stops for the lack of space at the device. How is this even possible?

Checklist

araffin commented 1 year ago

Hello, there is an important information missing which is your network architecture. The rollout buffer store things in the RAM not on the GPU. And most GPU memory is taken by weights and gradients.

Might be a duplicate of https://github.com/DLR-RM/stable-baselines3/issues/863

EloyAnguiano commented 1 year ago

Printing my model size with this:

def print_model_size(model):
    param_size = 0
    for param in model.parameters():
        param_size += param.nelement() * param.element_size()
    buffer_size = 0
    for buffer in model.buffers():
        buffer_size += buffer.nelement() * buffer.element_size()

    size_all_mb = (param_size + buffer_size) / 1024**2
    print(f'model size: {size_all_mb:.3f}MB')  # noqa: T201

And calling it like: print_model_size(agent.policy)

Returns: model size: 8.369MB

Is there any part of the agent that could be bigger? I am using my custom FeatureExtractor class but I assume its included at policy argument

EloyAnguiano commented 1 year ago

Also, whenever yoou run a batch od data on GPU ypu have to transfer those data to the cuda device, so the data are in GPU at some point, isn't it?

EloyAnguiano commented 1 year ago

I am still unable to figure out the problem in here nor in the https://github.com/DLR-RM/stable-baselines3/issues/863 issue. There, the solution was to flatten the observation, but this does not explain anything

EloyAnguiano commented 1 year ago

@araffin I think that GPU usage could be a bit more optimal. First of all, debugging the PPO class (the train method) I found that the GPU usage is a bit confusing, if I keep every hyperparameter fixed (n_steps, batch_size, etc...) but I change the number of environments at the vectorized environment, the GPU usage differs from one another:

1 environment: 1815MiB 16 environments: 8551MiB

I do not understand this as the self.rollout_buffer.size() is still the n_steps as before (32 in my case), so I do not know where does this come from. Indeed the only specifications that should affect the memory usage should be the policy size itself, the batch_size (this is key, the rollout_buffer should be at RAM and whenever we want to train with a batch, you retrieve those data to GPU) and the gradients of the model for the backpropagation.

Does this make any sense? Am I missing something?

araffin commented 1 year ago

This should answer your question:

https://github.com/DLR-RM/stable-baselines3/blob/aab545901fe331814f822060d677f22191cba419/stable_baselines3/common/buffers.py#L391-L398

EloyAnguiano commented 1 year ago

Yes, it does. Thanks a lot. Also, this opens up my ather question and it is that I think that the rollout buffer should not be on GPU and that the GPU usage should be controlled by the batch size at each epoch. Thus, you could collect a giant rollout_buffer but train on a small but fast GPU by choosing a correct batch_size. Isn't it?

araffin commented 1 year ago

https://github.com/DLR-RM/stable-baselines3/pull/1720#issuecomment-1776700957

EloyAnguiano commented 1 year ago

Sorry, I do not understand. If rollout buffer is always on cpu, why at https://github.com/DLR-RM/stable-baselines3/issues/1630#issuecomment-1775419847 the number of used environments improves the GPU usage?

EloyAnguiano commented 1 year ago

Indeed, if I debug PPO training at a GPU, I get this:

(Pdb) self.rollout_buffer.device
device(type='cuda', index=2)

This should mean that the data of the rollout_buffer are alocated at GPU

araffin commented 1 year ago

the number of used environments improves the GPU usage?

are you using subprocesses? if so, that might be due to the way python multiprocessing work.

This should mean that the data of the rollout_buffer are alocated at GPU

if you look at the code (and you should), the device is only used here: https://github.com/DLR-RM/stable-baselines3/blob/aab545901fe331814f822060d677f22191cba419/stable_baselines3/common/buffers.py#L127-L139

when sampling the data there: https://github.com/DLR-RM/stable-baselines3/blob/aab545901fe331814f822060d677f22191cba419/stable_baselines3/common/buffers.py#L520

EloyAnguiano commented 1 year ago

I am creating the environment like this:

gym_env = make_vec_env(make_env,
                               env_kwargs=env_kwargs,
                               n_envs=args.n_envs,
                               vec_env_cls=SubprocVecEnv)

So I assume it uses some kind of multiprocessing, yes. What does this has to do with GPU usage?

EloyAnguiano commented 1 year ago

Hi again @araffin . I am still unable to figure out how, it the transition of data from RolloutBuffer is done at each sampling, how can the GPU usage be so big just when the code goes into train method, as this should not have any data on GPU, only the model.