dhoule36 commented 4 years ago

I am a student at MIT in Russ Tedrake's Robotic Manipulation course. I have been able to successfully run one of his scripts (http://manipulation.csail.mit.edu/manipulation/clutter_maskrcnn_data.py) to generate training images within the Google Colab environment, until yesterday. Google Colab now crashes after generating about 90 images saying that it has used all of the RAM, where Colab Pro has 25GB.

With Colab, I use the following lines to install drake:

# Install drake.
if 'google.colab' in sys.modules and importlib.util.find_spec('manipulation') is None:
    urlretrieve(f"http://manipulation.csail.mit.edu/scripts/setup/setup_manipulation_colab.py",
                "setup_manipulation_colab.py")
    from setup_manipulation_colab import setup_manipulation
    setup_manipulation(manipulation_sha='master', drake_version='latest', drake_build='continuous')

I am able to successfully generate 1,000 images with the script by installing a previous build: setup_manipulation(manipulation_sha='fa5bcfb6367cd0cfda0e3d11e11854d68b39478a', drake_version='20201118', drake_build='nightly')

The script itself, linked above, calls a generate_image function repeatedly, where I noticed that the inactive memory grows by about 300MB with every execution. It uses a multibody plant, adding an rgbd camera for capturing the training images. It drops random objects from the ycb dataset into a bin before grabbing the image from the camera's output port.

Let me know if you need me to provide any more information!

RussTedrake commented 3 years ago

I am working with the OP to produce a better repro. I will reassign when we have it.

dhoule36 commented 3 years ago

I was able to create the simplest case of the issue that I was running into in Colab. I found that changing the drake version from 'latest' to '20201119' was the last nightly build that successfully does not run out of RAM ('20201120' does). Additionally, it seems that there is a deprecation warning that I receive in the same versions as the error:

Deprecated: CameraProperties are being deprecated. Please use the RenderCamera variant. This will be removed from Drake on or after 2021-03-01.

import importlib
import sys
from urllib.request import urlretrieve

# Install drake.
if 'google.colab' in sys.modules and importlib.util.find_spec('pydrake') is None:
  version='latest'
  build='nightly'
  urlretrieve(f"https://drake-packages.csail.mit.edu/drake/{build}/drake-{version}/setup_drake_colab.py", "setup_drake_colab.py")
  from setup_drake_colab import setup_drake
  setup_drake(version=version, build=build)

import psutil

import os
import numpy as np
from tqdm import tqdm

import pydrake.all
from pydrake.all import RigidTransform, RollPitchYaw

ycb = ['003_cracker_box.sdf', '004_sugar_box.sdf', '005_tomato_soup_can.sdf', '006_mustard_bottle.sdf', '009_gelatin_box.sdf', '010_potted_meat_can.sdf']
path = '/tmp/clutter_maskrcnn_data'
num_images = 10000

rng = np.random.default_rng()  # this is for python
generator = pydrake.common.RandomGenerator(rng.integers(1000))  # for c++

def generate_image(image_num):
    builder = pydrake.systems.framework.DiagramBuilder()
    plant, scene_graph = pydrake.multibody.plant.AddMultibodyPlantSceneGraph(builder, time_step=0.0005)
    parser = pydrake.multibody.parsing.Parser(plant)
    parser.AddModelFromFile(pydrake.common.FindResourceOrThrow(
        "drake/examples/manipulation_station/models/bin.sdf"))
    plant.WeldFrames(plant.world_frame(), plant.GetFrameByName("bin_base"))
    inspector = scene_graph.model_inspector()

    instance_id_to_class_name = dict()

    for object_num in range(rng.integers(1,10)):
        this_object = ycb[rng.integers(len(ycb))]
        class_name = os.path.splitext(this_object)[0]
        sdf = pydrake.common.FindResourceOrThrow("drake/manipulation/models/ycb/sdf/" + this_object)
        instance = parser.AddModelFromFile(sdf, f"object{object_num}")

        frame_id = plant.GetBodyFrameIdOrThrow(
            plant.GetBodyIndices(instance)[0])
        geometry_ids = inspector.GetGeometries(
            frame_id, pydrake.geometry.Role.kPerception)
        for geom_id in geometry_ids:
            instance_id_to_class_name[int(inspector.GetPerceptionProperties(geom_id).GetProperty("label", "id"))] = class_name

    plant.Finalize()

    renderer = "my_renderer"
    scene_graph.AddRenderer(
        renderer, pydrake.geometry.render.MakeRenderEngineVtk(pydrake.geometry.render.RenderEngineVtkParams()))
    properties = pydrake.geometry.render.DepthCameraProperties(width=640,
                                        height=480,
                                        fov_y=np.pi / 4.0,
                                        renderer_name=renderer,
                                        z_near=0.1,
                                        z_far=10.0)
    camera = builder.AddSystem(
        pydrake.systems.sensors.RgbdSensor(parent_id=scene_graph.world_frame_id(),
                    X_PB=RigidTransform(
                        RollPitchYaw(np.pi, 0, np.pi/2.0),
                        [0, 0, .8]),
                    properties=properties,
                    show_window=False))

for image_num in range(num_images):
    print(f"Current Memory: {psutil.virtual_memory()}")
    generate_image(image_num)

sherm1 commented 3 years ago

This is likely related to PR #14375 or #14376 which merged on 11/20. cc @SeanCurtis-TRI @rpoyner-tri

SeanCurtis-TRI commented 3 years ago

Additionally, it seems that there is a deprecation warning that I receive in the same versions as the error.

For clarity, you're saying the deprecation warning is correlated with the leak?

Assuming that's the case, I have two thoughts:

Update your code to not use DepthCameraProperties but ColorRenderCamera and DepthRenderCamera. The Drake examples have been updated.
@EricCousineau-TRI could this be related to the nature of our conservative exercise of the python binding's keep alive protocols?

EricCousineau-TRI commented 3 years ago

@EricCousineau-TRI could this be related to the nature of our conservative exercise of the python binding's keep alive protocols?

Most likely... Since we have that reference cycle, the diagram will be staying alive, including the SceneGraph and all related assets.

My money is that my PR, #14356 (b1d0617), is the main offender here.

@dhoule36 is there any chance that you have time to see if code before this commit (a) doesn't crash and (b) doesn't leak memory as you run stuff w/in a loop?

Suggested commit:

$ git log -n 1 --oneline --no-decorate b1d0617~
commit aa02a6bd9765478d6f16c448239a4e2fa9474041 (HEAD)
Author: Eric Cousineau <eric.cousineau@tri.global>
Date:   Thu Nov 19 16:41:17 2020 -0500

    py examples: Ensure manipulation_station_py.cc imports dep modules (#14370)

So I'd use:

version = '20201118'
build = 'nightly'
...
setup_drake(version=version, build=build)

EricCousineau-TRI commented 3 years ago

Trying on my end: https://colab.research.google.com/github/EricCousineau-TRI/repro/blob/3d297d22117941d773a954547fc47f673987a111/drake_stuff/issues/issue_14387/repro.ipynb

EricCousineau-TRI commented 3 years ago

Confirmed; I'm like 99.99% sure that #14356 is the offender. In @dhoule36's notebook, I see memory increase using 20201120 (after that commit), but not on 20201119 (before that commit).

Which means, we're stuck between a rock and hard place. If we revert, then it means segfaults. If we keep it, then it means memory leaks, which is really bad for data gen.

I don't have a good idea on how to easily proceed :( Ideas?

(See updated permalink for more details)

dhoule36 commented 3 years ago

@SeanCurtis-TRI

For clarity, you're saying the deprecation warning is correlated with the leak?

I don't think they're correlated. I followed your suggestion to not use DepthCameraProperties but ColorRenderCamera and DepthRenderCamera, and while there is no longer a warning, I am still seeing the memory increase.

@EricCousineau-TRI I'm just seeing this now. Did you still want me to try out the code? It looks like you may have already done so.

EricCousineau-TRI commented 3 years ago

Did you still want me to try out the code? It looks like you may have already done so.

Nah, we should be good now, thank you for checking!

EricCousineau-TRI commented 3 years ago

Possible solutions:

Revert #14356 and say beware
Just keep the leak?
Get overly creative with pybind11 and see if we can somehow transmit keep alive in the specific instance of DiagramBuilder -> Diagram.

(3) is probably feasible, but will be ridden with std::function<>, type erasure, and nastiness all around... or shared_ptr (#13058).

RussTedrake commented 3 years ago

I’m honestly not sure what’s best. All three options seem very undesirable. Would appreciate thoughts from the other developers.

On Tue, Dec 1, 2020 at 4:20 PM Eric Cousineau notifications@github.com wrote:

Possible solutions:

Revert #14356 https://github.com/RobotLocomotion/drake/pull/14356 and say beware

Just keep the leak?

Get overly creative with pybind11 and see if we can somehow transmit keep alive in the specific instance of DiagramBuilder -> Diagram.

(3) is probably feasible, but will be ridden with std::function<>, type erasure, and nastiness all around... or shared_ptr (#13058 https://github.com/RobotLocomotion/drake/issues/13058).

— You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub https://github.com/RobotLocomotion/drake/issues/14387#issuecomment-736827476, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRE2NFYGJU2ERFFPEPHY33SSVMZXANCNFSM4UGE5PNA .

jwnimmer-tri commented 3 years ago

For my part, I'm missing part of the backstory here.

Python has reference cycle garbage collection. Cycles are not, in general, a disaster. Reference cycles might increase the delay until the memory is reclaimed, but they should not cause unbounded growth.

I assume that keep_alive is implemented in a way that does not allow python to detect the cycles. What are the details of how keep_alive is implemented, and why is doesn't play nice with the gc?

EricCousineau-TRI commented 3 years ago

Will be investigating this (in a slow-burn fashion). See: https://github.com/pybind/pybind11/issues/2761

EricCousineau-TRI commented 3 years ago

@ggould-tri ran into this, but in a different form. He had a segfault on his machine; on my machine, it was Too many open files from too many LCM instances being created and not destroyed. Workaround is to save 'em: https://github.com/EricCousineau-TRI/repro/commit/4a1c9efd

ggould-tri commented 3 years ago

Yes, I can confirm that in my case reusing a single DrakeLcm fixes my issue, implying that something in DrakeLcm scope is leaked (at least the receive thread, based on gdb thread apply all bt results on the core dump)

allenzren commented 2 years ago

I also recently encountered this issue; my setup is very similar to @dhoule36's. I was collecting rollouts in scooping veggie task and I had to re-build before every rollout to import new geometries. Memory grew 300mb every time, so I had to revert #14356.

If we don't have to re-build every time, that also partially solves the issue.

jwnimmer-tri commented 2 months ago

Python has reference cycle garbage collection. Cycles are not, in general, a disaster. Reference cycles might increase the delay until the memory is reclaimed, but they should not cause unbounded growth.

Having accumulated a lot more Python knowledge since I wrote that, my current hypothesis / understanding is that python only collects cycles through Python containers (list, set, dict), so we are going to need to rework our C++ containers that participate in cycles (like Diagram and Simulator) to expose their children as python lists, with the C++ storage weak-aliasing the python storage instead of the other way around.

EricCousineau-TRI commented 2 months ago

Not sure if it was mentioned on this issue, but just as FYI, reference cycles are an issue here because they seem to be declared in a way that the Python GC cannot detect / prune, i.e. https://github.com/RobotLocomotion/pybind11/blob/51d715e037386fcdbeda75ffab15f02f8e4388d8/include/pybind11/pybind11.h#L2701-L2729

EDIT: Example exported from Anzu https://github.com/RobotLocomotion/drake/compare/master...EricCousineau-TRI:drake:feature-clear-patients-example

jwnimmer-tri commented 2 months ago

Next tasks:

(1) Make a reproducer demo program that uses pydrake, with a "level" flag to choose how complicated of an example it should be. The simplest level will be something like an empty Diagram + Simulator without even advancing. The most complicated level will have render engines, etc. something like examples/hardware_sim. (Jeremy)

(2) Survey for the latest instrumentation & diagnostic tools to help give us leverage and observability. (Rico)

(3) Confirm that we see leak(s) on the reproducer demo, at what kind of rate(s) for each level.

(4) Create toy bindings (simpler than the entire systems framework & etc) that share the same architecture, and confirm that the toy likewise reproduces the leak(s).

(5) Iterate on solutions using the toy bindings. My first guess is that we will need to store owned children like "Simulator has-a Diagram" and "Diagram has-a LeafSystem(s)" in a py::list attribute of the bound parent, so that normal GC cycle detection will apply.

In the marginal dead time, work on #20491 so that our edit-test cycle on the real problem will be faster. (Rico)

In the marginal dead time, work on #20260 (or something similar, like sharding the docs) so that our edit-test cycle on the real problem will be faster. (Jeremy)

EricCousineau-TRI commented 2 months ago

Regarding (1) and (3), I have a simple example showing leakage using weakref.ref() in my above branch: https://github.com/RobotLocomotion/drake/blob/cfd1f534c15dc74d0e5438b28d535eb4382d7e20/tmp/test/pybind_lifecycle_test.py#L47-L84 Note that the "hack" in this case (removing pybind11 patients) shows that the objects are appropriately freed.

jwnimmer-tri commented 2 months ago

The goal for (1) is to make a reproducer akin to what a user would do, so import weakref cannot be part of that story.

RobotLocomotion / drake

pydrake: Memory Leak While Generating Training Images (leaking Diagrams w/ MultibodyPlant, SceneGraph, etc) #14387

22059

22075

22068