RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.32k stars 1.26k forks source link

pydrake: Permit granular Bazel targets (linking analysis, or providing granular shared libs) #6760

Closed EricCousineau-TRI closed 2 years ago

EricCousineau-TRI commented 7 years ago

UPDATE (2019/08/22):

Per this comment below, I've renamed this to address the real problem, which is that //bindings/pydrake is (properly) encapsulated due to the linking of monolithic libdrake.so.

At some point, I (and others) want to consume parts of Drake for Python (e.g. just //math). I hate having to build all of libdrake.so just so I can import RigidTransform. This also makes ODR issues a pain when doing work in Anzu or other downstream projects (like what Mark mentioned below).


OLD Discussion:

The goal of PR #6465 was to mitigate duplicate global variable definitions by relying on compile-time linking to resolve duplicates and consolidating it to link everything to libdrake.so. (The original issue was caused by the original pydrake bindings which dynamically loaded individual *.sos -- which would not see each other's internal symbol linkages -- that had redundant dependencies.)

There may be a viable workaround, using dlopens RTLD_GLOBAL flag to prevent duplicate symbols (only for the drake libraries, and possibly the upstream dependencies).

Proofs of concept:

NOTE: Producers are singletons, consumers are, uh, things calling the producers from different *.so files.

Motivation: I'd like to be able to be able to test out Python bindings (for development and algorithm development) without having to recompile the entirety of libdrake.so, especially if tinkering with core components with tons of downstream dependencies :(

Potential caveats:

\cc @soonho-tri @jwnimmer-tri

EricCousineau-TRI commented 7 years ago

Related to #5752 and #7185:

I've been tinkering with a more modular Python setup to avoid the present monolith workaround and avoid this RTLD_GLOBAL hack, and I think I've encountered the caveats to shared libraries on Mac OSX El Capitan that others have encountered, specifically the clash between El Capitan's introduction of SIP's constraints on relative RPATHs, and Bazel's desire to make everything relocatable via the osx_cc_wrapper.sh.

@jwnimmer-tri @jamiesnape Just to check if I'm on the right track, behind bazelbuild/bazel#492, is this also one of the constraints that makes developing with more modular shared libraries a headache? (I faintly remember a mention of shared libraries on Mac being an issue, and see the mention in drake_cc_library of dyld mysterious errors...)

I ask because I was able to get some toy Python pybind libraries to play more nicely within Bazel on Linux by just using linkshared = 1 on cc_libraries (which seemed to be fine, at least with Bazel 0.6.1): https://github.com/EricCousineau-TRI/repro/blob/8327ce6ac1e51b61799b5486fc6dbb8eaee1599c/python/bindings/pymodule/global_check/BUILD#L21

It seems that Tensorflow's solution is to isolate it in a container, handcraft your Python setup, or disable SIP (which does not seem like a great solution unless you're air-gapped...) OpenCV also encountered this issue, and one of the contributors seemed to workaround it by doing the opposite of osx_cc_wrapper.sh: https://github.com/opencv/opencv/issues/5447 I wouldn't mind this, as we could just repurpose this script to play nicely within the build's own execroot - replacing some paths to bypass temporaries - and then perhaps figure out how to reconfigure upon installation.

Of course, that's contingent upon how 492 also affects this. (On Linux, at least for Bazel 0.6.1-produced libraries, it seems fine, so perhaps externals can be re-written to support this workflow?)

jwnimmer-tri commented 7 years ago

... is this also one of the constraints that makes developing with more modular shared libraries a headache? (I faintly remember a mention of shared libraries on Mac being an issue, and see the mention in drake_cc_library of dyld mysterious errors...)

Unfortunately, I don't have any details on what those challenges were. I just remember David fighting with it for days, from which my lesson was "test all shared-library-related PRs on Mac before merging".

jamiesnape commented 7 years ago

Honestly, linkshared = 1 on Mac should be fine, though how we package it may be an open question (I have not thought about this, hoping bazelbuild/bazel#492 helps). We have been manipulating RPATHs on Mac for the install scripts, so I think we have a reasonable understanding. osx_cc_wrapper.sh is really a solution to an platform-level problem rather than a desire of Bazel. CMake does basically the same thing (though not implemented as a wrapper, fortunately). Certainly, we do not want (or need) to do what OpenCV has done.

FYI @fbudin69500.

EricCousineau-TRI commented 7 years ago

Gotcha, thank y'all for the explanation!

Certainly, we do not want (or need) to do what OpenCV has done.

At least for Python / dlopen workflows, it seems like we may want something akin to this. When I try to build the above global_check_fail_test Python module under Mac OSX Sierra, I get the SIP error regarding relative paths: gist: bazel_pybind11_mac_sip_error

I did try linkshared = 1 on Mac for a simple cc_binary, and that did work as you mentioned - thanks!

osx_cc_wrapper.sh is really a solution to an platform-level problem rather than a desire of Bazel. CMake does basically the same thing (though not implemented as a wrapper, fortunately).

Does this mean we'd run into the same SIP-related error for dlopen as well?

jamiesnape commented 7 years ago

At least for Python / dlopen workflows, it seems like we may want something akin to this. When I try to build the above global_check_fail_test Python module under Mac OSX Sierra, I get the SIP error regarding relative paths: gist: bazel_pybind11_mac_sip_error

This is with the Homebrew Python?

EricCousineau-TRI commented 7 years ago

I guess not. Doing which python shows it at /usr/bin/python. In $(brew --prefix)/bin/, there's python2 and python3, but no python. Will try shadowing this on my $PATH and see what it does.

EDIT: That worked! Thanks! Is this documented anywhere? (That Mac users should ensure that python will point to their homebrew version?)

jamiesnape commented 7 years ago

Is this documented anywhere? (That Mac users should ensure that python will point to their homebrew version?)

It must be new in Bazel 0.6.1 that (prepending the Homebrew Python to the PATH) actually works: https://github.com/RobotLocomotion/drake/issues/6686

EricCousineau-TRI commented 7 years ago

Ah, that makes more sense now, thank you for pointing that out!

Also, looking more at the Bazel issue on transitive linking, it looks like one option for development would be to always use shared linking, but once we want to start installing artifacts, we force linkstatic = 1, which can still consume the same targets, and thus could use it on something like libdrake.so in the interim until that issue is somehow resolved (or we manually make more granular shared libraries).

For Python, in *.bzl macro file, there can be a simple switch, or even a config setting, that by default will build everything as-is, linking against libdrake.so, but if turned on, it will (a) link Python to shared objects, and (b) most likely break installation of these artifacts, which is fine by me as it'd be a development only workflow, and pave the way to breaking up the Python dependencies once we have something more granular.

The main caveat is installed binary dependencies, but meh. For drake_cc_library, we could have link_static = 0 by default, and for drake_cc_binary, we could have link_static = 1 by default. Will briefly test this out and see how it affects Python development.

EricCousineau-TRI commented 7 years ago

I have prototyped two things: (a) adding some sort of devel switch to allow the Python bindings to use the Bazel cc_library intermediates, but also ensure that install targets maintain minimal-target-count integrity (i.e. still links against libdrake.so), and (b) a new layout in the pybind11 modules, specifically looking towards (i) putting the bindings in-line with the existing source (if we wish to go that way) and (ii) making the structure more like the rest of the Bazel code base by making it less monolithic and more modular.

I will test this on Mac, ensure that it does not mess anything up there.

Prototype WIP branch: https://github.com/RobotLocomotion/drake/compare/master...EricCousineau-TRI:feature/py_system_bindings-wip?expand=1 (NOTE: I removed some of the Python bindings to quickly test out a few things. If this is something to be taken to the next step, I would reinstate them of course.)

EricCousineau-TRI commented 7 years ago

After tuning some of the Bazel-built / -linked external libraries to use more consistent linking options, this approach now works in Ubuntu and on Mac in non-devel (including installation) and devel mode.

My last item is to ensure that this does not violate ODR in either development or installation mode, and I will test this using pydrake.symbolic given that this has an easily visible global, the symbolic::Variables ID.

If this is successful, I would like to start using this setup as soon as possible in master, and will sequence out the possible PR train.

EDIT: ODR test passes and works as expected. When using "development mode", but using static libraries (present state), the ODR test fails - creating two Variables from two separate modules increments two separate instances of Variable::get_next_id(); using shared libraries (in the prototype) Does the Right Thing and calls the same instance of Variable::get_next_id().

EricCousineau-TRI commented 7 years ago

Proposed PRs: (numbers are separate steps; letters indicate possible parallelism) Note that these address the general organization of the Python bindings as well, but with a focus on modularity.

1.a. Merge drake/bindings/python/pydrake and drake/bindings/pybind11, and move them to drake/bindings/pydrake. 1.b. Teach drake.bzl to default to using shared libraries.

2.. Separate pydrake into more modular components, still consuming libdrake.so. Have each submodule of pydrake generate its own install directive to make installation less hacky.

3.. Teach pydrake a ENABLE_DEVEL_MODE mode, which will directly consume Drake internal targets as shared objects, for quicker turnaround when developing leaf portions, without the need to break into libdrake.so and cutting it up.

Based off of the following WIP: branch | commit

fbudin69500 commented 7 years ago

@EricCousineau-TRI Concerning the SIP error because of the relative path, I am about to submit a patch which I think will solve this issue. It will remove the relative rpaths that have been added at build time and therefore are irrelevant after install. I'll add the PR# below.

EricCousineau-TRI commented 7 years ago

Awesome, thanks!

fbudin69500 commented 7 years ago

Here is the PR I was talking about: #7301 . It may not solve all your problems, but it may help.

soonho-tri commented 6 years ago

(I faintly remember a mention of shared libraries on Mac being an issue, and see the mention in drake_cc_library of dyld mysterious errors...)

FYI, I think the problem on Mac is fixed. See https://github.com/bazelbuild/bazel/issues/507 and https://github.com/bazelbuild/bazel/issues/3450. I think the patch was applied to bazel-0.6.0.

I've made the following changes to test it on my mac. I've checked that bazel test //solvers/... -c dbg ran without an error.

diff --git a/tools/skylark/drake_cc.bzl b/tools/skylark/drake_cc.bzl
index e42302a8e..1c45948a9 100644
--- a/tools/skylark/drake_cc.bzl
+++ b/tools/skylark/drake_cc.bzl
@@ -310,7 +310,7 @@ def drake_cc_library(
         deps = [],
         copts = [],
         gcc_copts = [],
-        linkstatic = 1,
+        linkstatic = 0,
         install_hdrs_exclude = [],
         **kwargs):
     """Creates a rule to declare a C++ library.
diff --git a/tools/workspace/gtest/gtest.BUILD.bazel b/tools/workspace/gtest/gtest.BUILD.bazel
index 4cf13b543..239567be7 100644
--- a/tools/workspace/gtest/gtest.BUILD.bazel
+++ b/tools/workspace/gtest/gtest.BUILD.bazel
@@ -42,7 +42,7 @@ cc_library(
         # This is a bazel-default rule, and does not need @drake//
         "@//conditions:default": [],
     }),
-    linkstatic = 1,
+    # linkstatic = 1,
     visibility = ["//visibility:public"],
 )

P.S. I used bazel-0.9.0-homebrew in the test.

mpetersen94 commented 6 years ago

I've been writing code that uses drake as an external and I have been creating pybindings alongside the C++ code. I keep running into weird bugs when building the project due to linking with @drake//:shared_library (i.e. a cc_test fails because the library it is testing depends on a library that depends on @drake//:shared_library). While example code on drake-external-examples suggests I need to link against it for any library that I want to create pybindings of but doing so has caused more compiler errors than excluding it from the dependencies has.

EricCousineau-TRI commented 5 years ago

Perhaps to tackle this from a different angle, @jwnimmer-tri had suggested a ways back to perhaps let Bazel analyze what would needed to be stripped from linking to avoid the dreaded diamond dependency problem.

I may file another issue as an alternative route, or rename this to be more general, like Make pydrake more granular.

I'm motivated by this because it's painful in Anzu to have C++ code that could be bound in Python quite easily, but we don't because there's overhead in having to avoid ODR violations from linking both static and shared lib portions of Drake.

\cc @kunimatsu-tri @thduynguyen @siyuanfeng-tri

EDIT: Yeah, I'm just gonna rename this issue.

EDIT 2: For diamond deps, could also do whatever is necessary to make a re-compiled version of the library that is for shared linking (Cc sources recompilation).

jwnimmer-tri commented 4 years ago

I would like to see https://github.com/bazelbuild/rules_cc/blob/master/examples/experimental_cc_shared_library.bzl become supported (non-experimental) before we delve too deeply into more than one shared library inside Drake.

EricCousineau-TRI commented 4 years ago

Can I ask what the technical benefits are for waiting on experimental_cc_shared_library? If they're a huge win, like solving the diamond dependency bits and being able to access the raw .a files, then I'm down.

But if it's just a moderate improvement in semantics (e.g. fewer wrapper rules), could we perhaps timebox how long we wait? Ideally, I'd like to start carving up our shared library parts starting at the end of January / start of Feb.

EricCousineau-TRI commented 4 years ago

Looking at the code, it seems like it does solve the diamond dep problem? (or at least, fail fast if there's overlap?)

I would like to start tinkering with it, even if it's experimetal, starting at that time. (Assuming that it works, of course.)

jwnimmer-tri commented 4 years ago

Correct, it (I think) deals with diamond deps / ODR. I am loathe to have us try to deal with ODR bespokely -- we need starlark tooling for that, and I'm hoping we can just use upstream tooling instead of writing out own.

We are also going to have to do something about circular package dependencies inside libdrake before we can finish carving it up.

EricCousineau-TRI commented 4 years ago

As the alternative to solving ODR, we should consider defining the public API for our Bazel targets, and stop using a default visibility of public (if that is possible).

jwnimmer-tri commented 2 years ago

One thing we could consider is trying to accelerate local development -- that during bazel test //bindings/pydrake/..., we could link the Drake object code differently than in a wheel release, i.e., somehow more incrementally. For example, each unit test could link and load only and exactly the object code it needed. This might make the iteration cycle shorter for small changes (to, e.g., drake/math vs pydrake's math_test.py). But we'd want to do this without changing the more monolithic nature of the final distribution, in order to avoid all of the headaches noted above. However, note that per #11802, we are going to have fewer, larger modules overall. Trying to tie that back into C++ dependencies is probably not a lot of gain anymore. To accelerate development, I'd much rather invest in our remote build infra, where it ships all of the work to the cloud anyway. That would fix both build latency and test latency, allow for shared caching, etc. Much higher payoff than diddling with BUILD deps that are difficult to tease apart and easy to get wrong in obscure ways.

The other related idea is that we could try to modularize Drake itself (by having a few large C++ shared libraries, and having pydrake then align with those). The discussions #15735 relate to this issue somewhat. I'd rather just leave that to whatever is best for C++ though, and then Python can ride that wave if we like. I don't think the Python build system alone is enough to justify a C++ library rework.

Given all of that, I'd like to advocate that we close this issue. I don't see any good bang for the buck on the horizon here. WDYT?

EricCousineau-TRI commented 2 years ago

Yup, SGTM to close it - thanks for the analysis and related issues!