Open EloyAnguiano opened 1 year ago
Hello, there is an important information missing which is your network architecture. The rollout buffer store things in the RAM not on the GPU. And most GPU memory is taken by weights and gradients.
Might be a duplicate of https://github.com/DLR-RM/stable-baselines3/issues/863
Printing my model size with this:
def print_model_size(model):
param_size = 0
for param in model.parameters():
param_size += param.nelement() * param.element_size()
buffer_size = 0
for buffer in model.buffers():
buffer_size += buffer.nelement() * buffer.element_size()
size_all_mb = (param_size + buffer_size) / 1024**2
print(f'model size: {size_all_mb:.3f}MB') # noqa: T201
And calling it like:
print_model_size(agent.policy)
Returns:
model size: 8.369MB
Is there any part of the agent that could be bigger? I am using my custom FeatureExtractor
class but I assume its included at policy argument
Also, whenever yoou run a batch od data on GPU ypu have to transfer those data to the cuda device, so the data are in GPU at some point, isn't it?
I am still unable to figure out the problem in here nor in the https://github.com/DLR-RM/stable-baselines3/issues/863 issue. There, the solution was to flatten the observation, but this does not explain anything
@araffin I think that GPU usage could be a bit more optimal. First of all, debugging the PPO class (the train method) I found that the GPU usage is a bit confusing, if I keep every hyperparameter fixed (n_steps, batch_size, etc...) but I change the number of environments at the vectorized environment, the GPU usage differs from one another:
1 environment: 1815MiB 16 environments: 8551MiB
I do not understand this as the self.rollout_buffer.size()
is still the n_steps
as before (32 in my case), so I do not know where does this come from. Indeed the only specifications that should affect the memory usage should be the policy size itself, the batch_size (this is key, the rollout_buffer should be at RAM and whenever we want to train with a batch, you retrieve those data to GPU) and the gradients of the model for the backpropagation.
Does this make any sense? Am I missing something?
This should answer your question:
Yes, it does. Thanks a lot. Also, this opens up my ather question and it is that I think that the rollout buffer should not be on GPU and that the GPU usage should be controlled by the batch size at each epoch. Thus, you could collect a giant rollout_buffer but train on a small but fast GPU by choosing a correct batch_size. Isn't it?
Sorry, I do not understand. If rollout buffer is always on cpu, why at https://github.com/DLR-RM/stable-baselines3/issues/1630#issuecomment-1775419847 the number of used environments improves the GPU usage?
Indeed, if I debug PPO training at a GPU, I get this:
(Pdb) self.rollout_buffer.device
device(type='cuda', index=2)
This should mean that the data of the rollout_buffer are alocated at GPU
the number of used environments improves the GPU usage?
are you using subprocesses? if so, that might be due to the way python multiprocessing work.
This should mean that the data of the rollout_buffer are alocated at GPU
if you look at the code (and you should), the device is only used here: https://github.com/DLR-RM/stable-baselines3/blob/aab545901fe331814f822060d677f22191cba419/stable_baselines3/common/buffers.py#L127-L139
when sampling the data there: https://github.com/DLR-RM/stable-baselines3/blob/aab545901fe331814f822060d677f22191cba419/stable_baselines3/common/buffers.py#L520
I am creating the environment like this:
gym_env = make_vec_env(make_env,
env_kwargs=env_kwargs,
n_envs=args.n_envs,
vec_env_cls=SubprocVecEnv)
So I assume it uses some kind of multiprocessing, yes. What does this has to do with GPU usage?
Hi again @araffin . I am still unable to figure out how, it the transition of data from RolloutBuffer is done at each sampling, how can the GPU usage be so big just when the code goes into train method, as this should not have any data on GPU, only the model.
❓ Question
I think I do not underestand the memory usage of SB3. I have a Dict observation space of some huge matrixes, so my observation space is 17MB approx:
I training a
PPO
agent over a Vectorized environment with themake_vec_env
function atn_envs = 2
and the hyperparameters of myPPO
agent aren_steps = 6
and mybatch_size
is16
. If I underestood well, my rollout buffer will ben_steps x n_envs = 12
so the rollout_buffer will be17 x 12 = 204 MB
. I assume that thebatch_size
of16
will get the minimum so it is equivalent of having a batch size of12
.The problem here is that when I'm using a GPU device (80GB A100) it stabilizes at 70GB of usage at the beginning and a little bit later it stops for the lack of space at the device. How is this even possible?
Checklist