Open dhoule36 opened 4 years ago
I am working with the OP to produce a better repro. I will reassign when we have it.
I was able to create the simplest case of the issue that I was running into in Colab. I found that changing the drake version from 'latest' to '20201119' was the last nightly build that successfully does not run out of RAM ('20201120' does). Additionally, it seems that there is a deprecation warning that I receive in the same versions as the error:
Deprecated: CameraProperties are being deprecated. Please use the RenderCamera variant. This will be removed from Drake on or after 2021-03-01.
import importlib
import sys
from urllib.request import urlretrieve
# Install drake.
if 'google.colab' in sys.modules and importlib.util.find_spec('pydrake') is None:
version='latest'
build='nightly'
urlretrieve(f"https://drake-packages.csail.mit.edu/drake/{build}/drake-{version}/setup_drake_colab.py", "setup_drake_colab.py")
from setup_drake_colab import setup_drake
setup_drake(version=version, build=build)
import psutil
import os
import numpy as np
from tqdm import tqdm
import pydrake.all
from pydrake.all import RigidTransform, RollPitchYaw
ycb = ['003_cracker_box.sdf', '004_sugar_box.sdf', '005_tomato_soup_can.sdf', '006_mustard_bottle.sdf', '009_gelatin_box.sdf', '010_potted_meat_can.sdf']
path = '/tmp/clutter_maskrcnn_data'
num_images = 10000
rng = np.random.default_rng() # this is for python
generator = pydrake.common.RandomGenerator(rng.integers(1000)) # for c++
def generate_image(image_num):
builder = pydrake.systems.framework.DiagramBuilder()
plant, scene_graph = pydrake.multibody.plant.AddMultibodyPlantSceneGraph(builder, time_step=0.0005)
parser = pydrake.multibody.parsing.Parser(plant)
parser.AddModelFromFile(pydrake.common.FindResourceOrThrow(
"drake/examples/manipulation_station/models/bin.sdf"))
plant.WeldFrames(plant.world_frame(), plant.GetFrameByName("bin_base"))
inspector = scene_graph.model_inspector()
instance_id_to_class_name = dict()
for object_num in range(rng.integers(1,10)):
this_object = ycb[rng.integers(len(ycb))]
class_name = os.path.splitext(this_object)[0]
sdf = pydrake.common.FindResourceOrThrow("drake/manipulation/models/ycb/sdf/" + this_object)
instance = parser.AddModelFromFile(sdf, f"object{object_num}")
frame_id = plant.GetBodyFrameIdOrThrow(
plant.GetBodyIndices(instance)[0])
geometry_ids = inspector.GetGeometries(
frame_id, pydrake.geometry.Role.kPerception)
for geom_id in geometry_ids:
instance_id_to_class_name[int(inspector.GetPerceptionProperties(geom_id).GetProperty("label", "id"))] = class_name
plant.Finalize()
renderer = "my_renderer"
scene_graph.AddRenderer(
renderer, pydrake.geometry.render.MakeRenderEngineVtk(pydrake.geometry.render.RenderEngineVtkParams()))
properties = pydrake.geometry.render.DepthCameraProperties(width=640,
height=480,
fov_y=np.pi / 4.0,
renderer_name=renderer,
z_near=0.1,
z_far=10.0)
camera = builder.AddSystem(
pydrake.systems.sensors.RgbdSensor(parent_id=scene_graph.world_frame_id(),
X_PB=RigidTransform(
RollPitchYaw(np.pi, 0, np.pi/2.0),
[0, 0, .8]),
properties=properties,
show_window=False))
for image_num in range(num_images):
print(f"Current Memory: {psutil.virtual_memory()}")
generate_image(image_num)
This is likely related to PR #14375 or #14376 which merged on 11/20. cc @SeanCurtis-TRI @rpoyner-tri
Additionally, it seems that there is a deprecation warning that I receive in the same versions as the error.
For clarity, you're saying the deprecation warning is correlated with the leak?
Assuming that's the case, I have two thoughts:
DepthCameraProperties
but ColorRenderCamera
and DepthRenderCamera
. The Drake examples have been updated.@EricCousineau-TRI could this be related to the nature of our conservative exercise of the python binding's keep alive protocols?
Most likely... Since we have that reference cycle, the diagram will be staying alive, including the SceneGraph
and all related assets.
My money is that my PR, #14356 (b1d0617), is the main offender here.
@dhoule36 is there any chance that you have time to see if code before this commit (a) doesn't crash and (b) doesn't leak memory as you run stuff w/in a loop?
Suggested commit:
$ git log -n 1 --oneline --no-decorate b1d0617~
commit aa02a6bd9765478d6f16c448239a4e2fa9474041 (HEAD)
Author: Eric Cousineau <eric.cousineau@tri.global>
Date: Thu Nov 19 16:41:17 2020 -0500
py examples: Ensure manipulation_station_py.cc imports dep modules (#14370)
So I'd use:
version = '20201118'
build = 'nightly'
...
setup_drake(version=version, build=build)
Confirmed; I'm like 99.99% sure that #14356 is the offender.
In @dhoule36's notebook, I see memory increase using 20201120
(after that commit), but not on 20201119
(before that commit).
Which means, we're stuck between a rock and hard place. If we revert, then it means segfaults. If we keep it, then it means memory leaks, which is really bad for data gen.
I don't have a good idea on how to easily proceed :( Ideas?
(See updated permalink for more details)
@SeanCurtis-TRI
For clarity, you're saying the deprecation warning is correlated with the leak?
I don't think they're correlated. I followed your suggestion to not use DepthCameraProperties but ColorRenderCamera and DepthRenderCamera, and while there is no longer a warning, I am still seeing the memory increase.
@EricCousineau-TRI I'm just seeing this now. Did you still want me to try out the code? It looks like you may have already done so.
Did you still want me to try out the code? It looks like you may have already done so.
Nah, we should be good now, thank you for checking!
Possible solutions:
pybind11
and see if we can somehow transmit keep alive in the specific instance of DiagramBuilder -> Diagram
.(3) is probably feasible, but will be ridden with std::function<>
, type erasure, and nastiness all around... or shared_ptr
(#13058).
I’m honestly not sure what’s best. All three options seem very undesirable. Would appreciate thoughts from the other developers.
On Tue, Dec 1, 2020 at 4:20 PM Eric Cousineau notifications@github.com wrote:
Possible solutions:
- Revert #14356 https://github.com/RobotLocomotion/drake/pull/14356 and say beware
- Just keep the leak?
- Get overly creative with pybind11 and see if we can somehow transmit keep alive in the specific instance of DiagramBuilder -> Diagram.
(3) is probably feasible, but will be ridden with std::function<>, type erasure, and nastiness all around... or shared_ptr (#13058 https://github.com/RobotLocomotion/drake/issues/13058).
— You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub https://github.com/RobotLocomotion/drake/issues/14387#issuecomment-736827476, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRE2NFYGJU2ERFFPEPHY33SSVMZXANCNFSM4UGE5PNA .
For my part, I'm missing part of the backstory here.
Python has reference cycle garbage collection. Cycles are not, in general, a disaster. Reference cycles might increase the delay until the memory is reclaimed, but they should not cause unbounded growth.
I assume that keep_alive
is implemented in a way that does not allow python to detect the cycles. What are the details of how keep_alive is implemented, and why is doesn't play nice with the gc?
Will be investigating this (in a slow-burn fashion). See: https://github.com/pybind/pybind11/issues/2761
@ggould-tri ran into this, but in a different form. He had a segfault on his machine; on my machine, it was Too many open files
from too many LCM instances being created and not destroyed. Workaround is to save 'em:
https://github.com/EricCousineau-TRI/repro/commit/4a1c9efd
Yes, I can confirm that in my case reusing a single DrakeLcm fixes my issue, implying that something in DrakeLcm scope is leaked (at least the receive thread, based on gdb thread apply all bt
results on the core dump)
I also recently encountered this issue; my setup is very similar to @dhoule36's. I was collecting rollouts in scooping veggie task and I had to re-build before every rollout to import new geometries. Memory grew 300mb every time, so I had to revert #14356.
If we don't have to re-build every time, that also partially solves the issue.
Python has reference cycle garbage collection. Cycles are not, in general, a disaster. Reference cycles might increase the delay until the memory is reclaimed, but they should not cause unbounded growth.
Having accumulated a lot more Python knowledge since I wrote that, my current hypothesis / understanding is that python only collects cycles through Python containers (list, set, dict), so we are going to need to rework our C++ containers that participate in cycles (like Diagram and Simulator) to expose their children as python list
s, with the C++ storage weak-aliasing the python storage instead of the other way around.
Not sure if it was mentioned on this issue, but just as FYI, reference cycles are an issue here because they seem to be declared in a way that the Python GC cannot detect / prune, i.e. https://github.com/RobotLocomotion/pybind11/blob/51d715e037386fcdbeda75ffab15f02f8e4388d8/include/pybind11/pybind11.h#L2701-L2729
EDIT: Example exported from Anzu https://github.com/RobotLocomotion/drake/compare/master...EricCousineau-TRI:drake:feature-clear-patients-example
Next tasks:
(1) Make a reproducer demo program that uses pydrake
, with a "level" flag to choose how complicated of an example it should be. The simplest level will be something like an empty Diagram + Simulator without even advancing. The most complicated level will have render engines, etc. something like examples/hardware_sim
. (Jeremy)
(2) Survey for the latest instrumentation & diagnostic tools to help give us leverage and observability. (Rico)
(3) Confirm that we see leak(s) on the reproducer demo, at what kind of rate(s) for each level.
(4) Create toy bindings (simpler than the entire systems framework & etc) that share the same architecture, and confirm that the toy likewise reproduces the leak(s).
(5) Iterate on solutions using the toy bindings. My first guess is that we will need to store owned children like "Simulator has-a Diagram" and "Diagram has-a LeafSystem(s)" in a py::list
attribute of the bound parent, so that normal GC cycle detection will apply.
In the marginal dead time, work on #20491 so that our edit-test cycle on the real problem will be faster. (Rico)
In the marginal dead time, work on #20260 (or something similar, like sharding the docs) so that our edit-test cycle on the real problem will be faster. (Jeremy)
Regarding (1) and (3), I have a simple example showing leakage using weakref.ref()
in my above branch:
https://github.com/RobotLocomotion/drake/blob/cfd1f534c15dc74d0e5438b28d535eb4382d7e20/tmp/test/pybind_lifecycle_test.py#L47-L84
Note that the "hack" in this case (removing pybind11 patients) shows that the objects are appropriately freed.
The goal for (1) is to make a reproducer akin to what a user would do, so import weakref
cannot be part of that story.
Progress:
I am a student at MIT in Russ Tedrake's Robotic Manipulation course. I have been able to successfully run one of his scripts (http://manipulation.csail.mit.edu/manipulation/clutter_maskrcnn_data.py) to generate training images within the Google Colab environment, until yesterday. Google Colab now crashes after generating about 90 images saying that it has used all of the RAM, where Colab Pro has 25GB.
With Colab, I use the following lines to install drake:
I am able to successfully generate 1,000 images with the script by installing a previous build:
setup_manipulation(manipulation_sha='fa5bcfb6367cd0cfda0e3d11e11854d68b39478a', drake_version='20201118', drake_build='nightly')
The script itself, linked above, calls a generate_image function repeatedly, where I noticed that the inactive memory grows by about 300MB with every execution. It uses a multibody plant, adding an rgbd camera for capturing the training images. It drops random objects from the ycb dataset into a bin before grabbing the image from the camera's output port.
Let me know if you need me to provide any more information!