Closed EricCousineau-TRI closed 2 years ago
Related to #5752 and #7185:
I've been tinkering with a more modular Python setup to avoid the present monolith workaround and avoid this RTLD_GLOBAL
hack, and I think I've encountered the caveats to shared libraries on Mac OSX El Capitan that others have encountered, specifically the clash between El Capitan's introduction of SIP's constraints on relative RPATH
s, and Bazel's desire to make everything relocatable via the osx_cc_wrapper.sh
.
@jwnimmer-tri @jamiesnape Just to check if I'm on the right track, behind bazelbuild/bazel#492, is this also one of the constraints that makes developing with more modular shared libraries a headache? (I faintly remember a mention of shared libraries on Mac being an issue, and see the mention in drake_cc_library
of dyld
mysterious errors...)
I ask because I was able to get some toy Python pybind libraries to play more nicely within Bazel on Linux by just using linkshared = 1
on cc_libraries
(which seemed to be fine, at least with Bazel 0.6.1):
https://github.com/EricCousineau-TRI/repro/blob/8327ce6ac1e51b61799b5486fc6dbb8eaee1599c/python/bindings/pymodule/global_check/BUILD#L21
It seems that Tensorflow's solution is to isolate it in a container, handcraft your Python setup, or disable SIP (which does not seem like a great solution unless you're air-gapped...)
OpenCV also encountered this issue, and one of the contributors seemed to workaround it by doing the opposite of osx_cc_wrapper.sh
:
https://github.com/opencv/opencv/issues/5447
I wouldn't mind this, as we could just repurpose this script to play nicely within the build's own execroot
- replacing some paths to bypass temporaries - and then perhaps figure out how to reconfigure upon installation.
Of course, that's contingent upon how 492 also affects this. (On Linux, at least for Bazel 0.6.1-produced libraries, it seems fine, so perhaps externals can be re-written to support this workflow?)
... is this also one of the constraints that makes developing with more modular shared libraries a headache? (I faintly remember a mention of shared libraries on Mac being an issue, and see the mention in drake_cc_library of dyld mysterious errors...)
Unfortunately, I don't have any details on what those challenges were. I just remember David fighting with it for days, from which my lesson was "test all shared-library-related PRs on Mac before merging".
Honestly, linkshared = 1
on Mac should be fine, though how we package it may be an open question (I have not thought about this, hoping bazelbuild/bazel#492 helps). We have been manipulating RPATHs on Mac for the install scripts, so I think we have a reasonable understanding. osx_cc_wrapper.sh
is really a solution to an platform-level problem rather than a desire of Bazel. CMake does basically the same thing (though not implemented as a wrapper, fortunately). Certainly, we do not want (or need) to do what OpenCV has done.
FYI @fbudin69500.
Gotcha, thank y'all for the explanation!
Certainly, we do not want (or need) to do what OpenCV has done.
At least for Python / dlopen
workflows, it seems like we may want something akin to this. When I try to build the above global_check_fail_test
Python module under Mac OSX Sierra, I get the SIP error regarding relative paths:
gist: bazel_pybind11_mac_sip_error
I did try linkshared = 1
on Mac for a simple cc_binary
, and that did work as you mentioned - thanks!
osx_cc_wrapper.sh is really a solution to an platform-level problem rather than a desire of Bazel. CMake does basically the same thing (though not implemented as a wrapper, fortunately).
Does this mean we'd run into the same SIP-related error for dlopen
as well?
At least for Python / dlopen workflows, it seems like we may want something akin to this. When I try to build the above global_check_fail_test Python module under Mac OSX Sierra, I get the SIP error regarding relative paths: gist: bazel_pybind11_mac_sip_error
This is with the Homebrew Python?
I guess not. Doing which python
shows it at /usr/bin/python
.
In $(brew --prefix)/bin/
, there's python2
and python3
, but no python
. Will try shadowing this on my $PATH
and see what it does.
EDIT: That worked! Thanks!
Is this documented anywhere? (That Mac users should ensure that python
will point to their homebrew version?)
Is this documented anywhere? (That Mac users should ensure that python will point to their homebrew version?)
It must be new in Bazel 0.6.1 that (prepending the Homebrew Python to the PATH) actually works: https://github.com/RobotLocomotion/drake/issues/6686
Ah, that makes more sense now, thank you for pointing that out!
Also, looking more at the Bazel issue on transitive linking, it looks like one option for development would be to always use shared linking, but once we want to start installing artifacts, we force linkstatic = 1
, which can still consume the same targets, and thus could use it on something like libdrake.so
in the interim until that issue is somehow resolved (or we manually make more granular shared libraries).
For Python, in *.bzl
macro file, there can be a simple switch, or even a config setting, that by default will build everything as-is, linking against libdrake.so
, but if turned on, it will (a) link Python to shared objects, and (b) most likely break installation of these artifacts, which is fine by me as it'd be a development only workflow, and pave the way to breaking up the Python dependencies once we have something more granular.
The main caveat is installed binary dependencies, but meh.
For drake_cc_library
, we could have link_static = 0
by default, and for drake_cc_binary
, we could have link_static = 1
by default.
Will briefly test this out and see how it affects Python development.
I have prototyped two things: (a) adding some sort of devel
switch to allow the Python bindings to use the Bazel cc_library
intermediates, but also ensure that install
targets maintain minimal-target-count integrity (i.e. still links against libdrake.so
), and (b) a new layout in the pybind11
modules, specifically looking towards (i) putting the bindings in-line with the existing source (if we wish to go that way) and (ii) making the structure more like the rest of the Bazel code base by making it less monolithic and more modular.
I will test this on Mac, ensure that it does not mess anything up there.
Prototype WIP branch: https://github.com/RobotLocomotion/drake/compare/master...EricCousineau-TRI:feature/py_system_bindings-wip?expand=1 (NOTE: I removed some of the Python bindings to quickly test out a few things. If this is something to be taken to the next step, I would reinstate them of course.)
After tuning some of the Bazel-built / -linked external libraries to use more consistent linking options, this approach now works in Ubuntu and on Mac in non-devel (including installation) and devel mode.
My last item is to ensure that this does not violate ODR in either development or installation mode, and I will test this using pydrake.symbolic
given that this has an easily visible global, the symbolic::Variable
s ID.
If this is successful, I would like to start using this setup as soon as possible in master
, and will sequence out the possible PR train.
EDIT: ODR test passes and works as expected. When using "development mode", but using static libraries (present state), the ODR test fails - creating two Variables from two separate modules increments two separate instances of Variable::get_next_id()
; using shared libraries (in the prototype) Does the Right Thing and calls the same instance of Variable::get_next_id()
.
Proposed PRs: (numbers are separate steps; letters indicate possible parallelism) Note that these address the general organization of the Python bindings as well, but with a focus on modularity.
1.a. Merge drake/bindings/python/pydrake
and drake/bindings/pybind11
, and move them to drake/bindings/pydrake
.
1.b. Teach drake.bzl
to default to using shared libraries.
drake_cc_external_library()
, to prevent externals from leaking (see below). ~This can pave the way for #7294, by providing something like cc_shared_library()
(the workaround mentioned in bazelbuild/bazel#492), which would allow for internal and external consumption with transitive linking, and should be robust for installation. (Example)~ (Too basic, misses diamond trap. However, good for simple externals like fmt
.)//:install
, similar to Francois's recent PR, and ensure that ldd
/ otool -L
for all / some targets have only installed targets as dependencies, e.g. no intermediate odd Bazel targets, for both internal and externals (like _solib_k8/_U@vtk _S_S_Cvtkglew___Uexternal_Svtk_Slib/libvtkglew-8.0.so.1
)2.. Separate pydrake
into more modular components, still consuming libdrake.so
. Have each submodule of pydrake
generate its own install directive to make installation less hacky.
3.. Teach pydrake
a ENABLE_DEVEL_MODE
mode, which will directly consume Drake internal targets as shared objects, for quicker turnaround when developing leaf portions, without the need to break into libdrake.so
and cutting it up.
@EricCousineau-TRI Concerning the SIP error because of the relative path, I am about to submit a patch which I think will solve this issue. It will remove the relative rpaths that have been added at build time and therefore are irrelevant after install. I'll add the PR# below.
Awesome, thanks!
Here is the PR I was talking about: #7301 . It may not solve all your problems, but it may help.
(I faintly remember a mention of shared libraries on Mac being an issue, and see the mention in drake_cc_library of dyld mysterious errors...)
FYI, I think the problem on Mac is fixed. See https://github.com/bazelbuild/bazel/issues/507 and https://github.com/bazelbuild/bazel/issues/3450. I think the patch was applied to bazel-0.6.0
.
I've made the following changes to test it on my mac. I've checked that bazel test //solvers/... -c dbg
ran without an error.
diff --git a/tools/skylark/drake_cc.bzl b/tools/skylark/drake_cc.bzl
index e42302a8e..1c45948a9 100644
--- a/tools/skylark/drake_cc.bzl
+++ b/tools/skylark/drake_cc.bzl
@@ -310,7 +310,7 @@ def drake_cc_library(
deps = [],
copts = [],
gcc_copts = [],
- linkstatic = 1,
+ linkstatic = 0,
install_hdrs_exclude = [],
**kwargs):
"""Creates a rule to declare a C++ library.
diff --git a/tools/workspace/gtest/gtest.BUILD.bazel b/tools/workspace/gtest/gtest.BUILD.bazel
index 4cf13b543..239567be7 100644
--- a/tools/workspace/gtest/gtest.BUILD.bazel
+++ b/tools/workspace/gtest/gtest.BUILD.bazel
@@ -42,7 +42,7 @@ cc_library(
# This is a bazel-default rule, and does not need @drake//
"@//conditions:default": [],
}),
- linkstatic = 1,
+ # linkstatic = 1,
visibility = ["//visibility:public"],
)
P.S. I used bazel-0.9.0-homebrew
in the test.
I've been writing code that uses drake as an external and I have been creating pybindings alongside the C++ code. I keep running into weird bugs when building the project due to linking with @drake//:shared_library
(i.e. a cc_test fails because the library it is testing depends on a library that depends on @drake//:shared_library
). While example code on drake-external-examples suggests I need to link against it for any library that I want to create pybindings of but doing so has caused more compiler errors than excluding it from the dependencies has.
Perhaps to tackle this from a different angle, @jwnimmer-tri had suggested a ways back to perhaps let Bazel analyze what would needed to be stripped from linking to avoid the dreaded diamond dependency problem.
I may file another issue as an alternative route, or rename this to be more general, like Make pydrake more granular
.
I'm motivated by this because it's painful in Anzu to have C++ code that could be bound in Python quite easily, but we don't because there's overhead in having to avoid ODR violations from linking both static and shared lib portions of Drake.
\cc @kunimatsu-tri @thduynguyen @siyuanfeng-tri
EDIT: Yeah, I'm just gonna rename this issue.
EDIT 2: For diamond deps, could also do whatever is necessary to make a re-compiled version of the library that is for shared linking (Cc sources recompilation).
I would like to see https://github.com/bazelbuild/rules_cc/blob/master/examples/experimental_cc_shared_library.bzl become supported (non-experimental) before we delve too deeply into more than one shared library inside Drake.
Can I ask what the technical benefits are for waiting on experimental_cc_shared_library
? If they're a huge win, like solving the diamond dependency bits and being able to access the raw .a
files, then I'm down.
But if it's just a moderate improvement in semantics (e.g. fewer wrapper rules), could we perhaps timebox how long we wait? Ideally, I'd like to start carving up our shared library parts starting at the end of January / start of Feb.
Looking at the code, it seems like it does solve the diamond dep problem? (or at least, fail fast if there's overlap?)
I would like to start tinkering with it, even if it's experimetal, starting at that time. (Assuming that it works, of course.)
Correct, it (I think) deals with diamond deps / ODR. I am loathe to have us try to deal with ODR bespokely -- we need starlark tooling for that, and I'm hoping we can just use upstream tooling instead of writing out own.
We are also going to have to do something about circular package dependencies inside libdrake before we can finish carving it up.
As the alternative to solving ODR, we should consider defining the public API for our Bazel targets, and stop using a default visibility of public (if that is possible).
One thing we could consider is trying to accelerate local development -- that during bazel test //bindings/pydrake/...
, we could link the Drake object code differently than in a wheel release, i.e., somehow more incrementally. For example, each unit test could link and load only and exactly the object code it needed. This might make the iteration cycle shorter for small changes (to, e.g., drake/math vs pydrake's math_test.py). But we'd want to do this without changing the more monolithic nature of the final distribution, in order to avoid all of the headaches noted above. However, note that per #11802, we are going to have fewer, larger modules overall. Trying to tie that back into C++ dependencies is probably not a lot of gain anymore. To accelerate development, I'd much rather invest in our remote build infra, where it ships all of the work to the cloud anyway. That would fix both build latency and test latency, allow for shared caching, etc. Much higher payoff than diddling with BUILD deps that are difficult to tease apart and easy to get wrong in obscure ways.
The other related idea is that we could try to modularize Drake itself (by having a few large C++ shared libraries, and having pydrake then align with those). The discussions #15735 relate to this issue somewhat. I'd rather just leave that to whatever is best for C++ though, and then Python can ride that wave if we like. I don't think the Python build system alone is enough to justify a C++ library rework.
Given all of that, I'd like to advocate that we close this issue. I don't see any good bang for the buck on the horizon here. WDYT?
Yup, SGTM to close it - thanks for the analysis and related issues!
UPDATE (2019/08/22):
Per this comment below, I've renamed this to address the real problem, which is that
//bindings/pydrake
is (properly) encapsulated due to the linking of monolithiclibdrake.so
.At some point, I (and others) want to consume parts of Drake for Python (e.g. just
//math
). I hate having to build all oflibdrake.so
just so I can importRigidTransform
. This also makes ODR issues a pain when doing work in Anzu or other downstream projects (like what Mark mentioned below).OLD Discussion:
The goal of PR #6465 was to mitigate duplicate global variable definitions by relying on compile-time linking to resolve duplicates and consolidating it to link everything to
libdrake.so
. (The original issue was caused by the originalpydrake
bindings which dynamically loaded individual*.so
s -- which would not see each other's internal symbol linkages -- that had redundant dependencies.)There may be a viable workaround, using
dlopen
sRTLD_GLOBAL
flag to prevent duplicate symbols (only for thedrake
libraries, and possibly the upstream dependencies).Proofs of concept:
-rdynamic
,RTLD_GLOBAL
, and potentially reloading a*.so
to be global (if we kludge up too much by switching Python's import mechanisms)pybind11
Case: example output in code directoryNOTE: Producers are singletons, consumers are, uh, things calling the producers from different
*.so
files.Motivation: I'd like to be able to be able to test out Python bindings (for development and algorithm development) without having to recompile the entirety of
libdrake.so
, especially if tinkering with core components with tons of downstream dependencies :(Potential caveats:
RTLD_GLOBAL
is enabled, possibly due to symbol collision. This may cause those errors if we are not conservative with how we load this.\cc @soonho-tri @jwnimmer-tri