[Question] PPO exhausts memory

genkv commented 2 years ago

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

Question

I'm using PPO with a robotics library called iGibson. Here's the sample code I have issue with.

num_environments = 8
env = SubprocVecEnv([make_env(i) for i in range(num_environments)])
env = VecMonitor(env)

...

model = PPO("MultiInputPolicy", env, verbose=1, tensorboard_log=tensorboard_log_dir, policy_kwargs=policy_kwargs)

...

model.learn(<1 million time steps>)

After the first iteration and it prints out the rollout information, the process would try to allocate large amount of memory that my 64 gb RAM + 100 gb SWAP are exhausted.

middle_img_v2_6869692d-a5da-4092-b1db-67be0e73dc1g

Killed by daemon when out of memory.

I noticed that decreasing the n_steps will mitigate this issue, but it will not converge and the trained model will have poor quality. Reducing the number of parallel environments also helps, but it's not a good idea for PPO since it's on-policy training.

What exactly is the code doing that exhausts so much memory? What other metrics should I look at to avoid the overwhelming memory usage?

Thank you

Update: Here is the custom environment that I use. The code is too long to paste here so I will just leave a url. I'm still new to the baselines library. When the memory exhausts, the system hangs so it's a little difficult to debug. My main questions are the bolded text. thanks.

Update 2: In my case, the observation space is consisted of 2 parts: 640x480 image from rgb camera, and 4-dimensional task observations including goal location, current location, etc (this is a navigation task).

The action space is a continuous Box [-1, 1] that controls the differential drive controller of the agent (robot) to move around.

Additional context

CPU: i7-10700 GPU: RTX A2000 12 GB 64 gb RAM 100 gb SWAP Torch 1.10.2 Stable-Baselines3 1.4.0

Checklist

[x] I have read the documentation (required)
[x] I have checked that there is no similar issue in the repo (required)

araffin commented 2 years ago

Hello, please fill up the custom gym env template.

genkv commented 2 years ago

Hello, please fill up the custom gym env template.

Hi, I updated it will a link to the custom gym environment because it's too long to paste it here. But I'm mostly curious about the questions in bolded text, that are more general questions about PPO. thank you.

araffin commented 2 years ago

I updated it will a link to the custom gym environment because it's too long to paste it here.

Please take a close look at the custom env issue template, we need a minimal example to reproduce the error, in your case, providing the observation space and action space is the most important part, my guess is that you have very high dimensional observation space that fills up your RAM when collecting data. There is also the fact that we are using float32 in PPO for storing everything (which may not be super efficient for images).

genkv commented 2 years ago

I updated it will a link to the custom gym environment because it's too long to paste it here.

Please take a close look at the custom env issue template, we need a minimal example to reproduce the error, in your case, providing the observation space and action space is the most important part, my guess is that you have very high dimensional observation space that fills up your RAM when collecting data. There is also the fact that we are using float32 in PPO for storing everything (which may not be super efficient for images).

In my case, the observation space is consisted of 2 parts: 640x480 image from rgb camera, and 4-dimensional task observations including goal location, current location, etc (this is a navigation task).

The action space is a continuous Box [-1, 1] that controls the differential drive controller of the agent (robot) to move around.

One thing that confuses me is, what mechanism enables the update phase to indefinitely eat up memory? when n_step is set to 1024, the problem is gone. But if I double n_step to 2048, the process will get stuck at the update step (after printing out the rollout information for the 1st iteration) and exhaust all available memory, even the 100gb swap.

Since rollout buffer = n_steps * num_environments, I should only need 2 times the memory budget for the rollout when I double the n_steps. But in practice it consumes way more than doubling the original memory usage.

Thank you.

araffin commented 2 years ago

One thing that confuses me is, what mechanism enables the update phase to indefinitely eat up memory?

You mean there is a memory leak? Does it happen with less n_steps? Could you profile what is taking most memory? (I would expect it to be the RolloutBuffer but we are allocating all needed RAM at the beginning of training, the memory used should not grow)

in your case, you need at least (2048 8 640 480 3 * 4) / 1e6 = 60 397 MB = 60GB to store the images for one rollout (float32 is 4 bytes, you can reduce that amount by using uint8 (1 byte), but you will need a custom version of SB3 for that).

What you can do:

preprocess the images, or at least reduce their dimension
use custom version of SB3
reduce n_steps or n_envs

AidanShipperley commented 2 years ago

I'm also experiencing issues with ppo, but I actually narrowed it down to way before I even create the environment. Simply running the line: from stable_baselines3 import ppo commits 2.8 gigabytes of ram on my system:

And when creating a vec environment (SubProcVecEnv), it creates all environments with that same commit size, 2.8 gigabytes. However, not one of the environments ever shows using above 200 megabytes.

I've tried installing python 3.10, 3.9, 3.8, and 3.7, same issue. I've tried restarting my computer multiple times, I've tried 3 completely new conda environments. No change.

However, I'm quite confident the issue is actually caused by the same thing that brought up this issue on SciPy's Github: It looks like that on Windows, multiprocessing initializes a completely new interpreter for each child process. This means that your imports are actually repeated in each child rather than just sharing the imported code in the memory of the parent.

I believe the issue here is that the algorithms like ppo, dqn, etc. could be committing ~180 megabytes per thread, which on a 16-core machine like I have, will take about 2.8 GB. I hope this helps.

Miffyli commented 2 years ago

I do not know the details of "commit size", but if that includes everything Python has loaded, then big part comes from PyTorch which is 1-2GB. I would guess you get the same result if you import just import torch (or create something small with cuda after import, e.g. x = torch.rand(5).cuda().

Multiprocessing on Windows

Yes, this is how multiprocessing works in Python in general :). But indeed the way processes are spawned in different systems differs, and Windows has been especially tricky at times.

AidanShipperley commented 2 years ago

I found a solution to my issue, however I can't really say if this is the same issue brought up by @genkv. It could still definitely help with memory allocated by PyTorch though.

Thank you @Miffyli, when I noticed it also committed the same absurd amount of memory on simply calling import torch, I did some digging and it's actually an issue caused by Nvidia fatbins (.nv_fatb) being loaded into memory, not by PyTorch specifically.

The following info is from this Stack Overflow answer:

Several DLLs, such as cusolver64_xx.dll, torcha_cuda_cu.dll, and a few others, have .nv_fatb sections in them. These contain tons of different variations of CUDA code for different GPUs, so it ends up being several hundred megabytes to a couple gigabytes. When Python imports 'torch' it loads these DLLs, and maps the .nv_fatb section into memory. For some reason, instead of just being a memory mapped file, it is actually taking up memory. The section is set as 'copy on write' . . . if you look at Python using VMMap ( https://docs.microsoft.com/en-us/sysinternals/downloads/vmmap ) you can see that these DLLs are committing huge amounts of committed memory for this .nv_fatb section. The frustrating part is it doesn't seem to be using the memory. For example, right now my Python.exe has 2.7GB committed, but the working set is only 148MB.

The answer also provides a python script which is intended to be run on your Lib\site-packages\torch\lib\ directory. It scans through all DLLs specified by the input glob, and if it finds an .nv_fatb section it will back up the DLL, disable ASLR, and mark the .nv_fatb section read-only.

ASLR is 'address space layout randomization.' It is a security feature that randomizes where a DLL is loaded in memory. We disable it for this DLL so that all Python processes will load the DLL into the same base virtual address. If all Python processes using the DLL load it at the same base address, they can all share the DLL. Otherwise each process needs its own copy.

Marking the section 'read-only' lets Windows know that the contents will not change in memory. If you map a file into memory read/write, Windows has to commit enough memory, backed by the pagefile, just in case you make a modification to it. If the section is read-only, there is no need to back it in the pagefile. We know there are no modifications to it, so it can always be found in the DLL.

The last important thing I can think to note is that Nvidia plans to set the .nv_fatb section to read-only in the next major CUDA release (11.7) according to the answer:

Per NVIDIA: "We have gone ahead and marked the nv_fatb section as read-only, this change will be targeting next major CUDA release 11.7 . We are not changing the ASLR, as that is considered a safety feature ."

After I ran the Python script, running import torch went from committing 2.8-2.9 GB of ram to 1.1-1.2 GB of ram, and my vectorized environments which would each commit 2.8-2.9 GB now only commit 1.1-1.2 GB each.

Hopefully this helps somebody!

bdytx5 commented 1 year ago

dividing my images size from 800,600 to 400,300 solved the issue

DLR-RM / stable-baselines3