Open RussTedrake opened 2 years ago
I did an experiment on my Ubuntu 22.04 Jammy workstation, with our default BLAS (Netlib) installed. I ran the benchmark program as of today (c8a004d7) with and without EIGEN_USE_BLAS
.
The benchmark command:
time bazel-bin/systems/benchmarking/multilayer_perceptron_performance --batch_size=1000 --iterations=10
To turn on BLAS, I added this patch to my BUILD file:
--- a/systems/primitives/BUILD.bazel
+++ b/systems/primitives/BUILD.bazel
@@ -201,6 +201,8 @@ drake_cc_library(
drake_cc_library(
name = "multilayer_perceptron",
+ copts = ["-DEIGEN_USE_BLAS"],
+ linkopts = ["-lblas"],
srcs = ["multilayer_perceptron.cc"],
hdrs = ["multilayer_perceptron.h"],
deps = [
With current master, it completed in approx 1.3s. With EIGEN_USE_BLAS, it completed in approx 3.0s.
There's likely some measurement error in there with no googlebench, but assessing a >2x slowdown with Netlib BLAS seems like a safe bet.
If I install Atlas BLAS instead (sudo apt install libatlas-base-dev
), then the time goes back to approx 1.3s, the same as without BLAS.
Given that, I'd say for our default Ubuntu (and cloud) users, EIGEN_USE_BLAS
does not seem to be worth enabling.
Perhaps there are different benchmark flags (different batch sizes) where BLAS still shows a benefit? I didn't try any different command-line options.
It's probable that the macOS Accelerate framework has a much better boost here compared to Ubuntu, by specializing for the hardware. (That's the BLAS we link to for macOS.) Perhaps also the non-BLAS source builds end up being better optimized for Ubuntu in the first place (more mature tools / platform), compared with macOS.
@RussTedrake were your earlier measurements during #16683 on an x86_64 mac or an arm64 mac (last February)? I'm not particularly keen on trying to optimize Drake on the legacy x86_64 macs, but it might still be worth exploring this ticket for arm64 macs. (We could make the flag conditional on platform.)
On the plus side, I tried building with -march=broadwell
and no BLAS on Ubuntu and the time dropped to approx 0.9s. I'll revisit Eigen vectorization / cpu options in a different ticket.
I put some time into researching how to change various Eigen settings for speed but still remain safe to compile.
There two main points to consider: layout, and ODR.
One question is so-called "layout" -- how the structured data is stored in memory and whether all of the TUs agree on how to access it from their low-level machine code, what size it's aligned to, and how to allocate / reallocate / free it. By default, changing SIMD flags (e.g., using AVX on x86_64) will affect Eigen's layout (see #14603), but turning on BLAS does not affect layout.
The way around the layout problem is to firewall the layout-differing TUs with header files that only ever mention layout-safe Eigen types, such as this one:
/// A matrix of dynamic size, templated on scalar type, that opts-out of any
/// memory-alignment guarantees.
template <typename Scalar, int Cols = Eigen::Dynamic>
using AbiSafeMatrix =
Eigen::Matrix<Scalar, Eigen::Dynamic, Cols, Eigen::DontAlign>;
With that, we define helper functions like this one:
/* Calculates Wₙxₙ+bₙ (notably, without any activation).
@tparam_default_scalar */
template <typename T>
void Calc_WX_plus_b(
const Eigen::Map<const AbiSafeMatrix<T>>& W,
const AbiSafeMatrix<T>& X,
const Eigen::Map<const AbiSafeMatrix<T, 1>>& b,
AbiSafeMatrix<T>* Wx,
AbiSafeMatrix<T>* Wx_plus_b) {
if constexpr (std::is_same_v<T, double>) {
const Eigen::Index Wx_rows = W.rows();
const Eigen::Index Wx_cols = X.cols();
Wx->resize(Wx_rows, Wx_cols);
Wx_plus_b->resize(Wx_rows, Wx_cols);
// This is a function call into a different TU; it will use SIMD operations.
double_operator_WX_plus_b{W, X, b, Wx, Wx_plus_b}.Calc();
} else {
Wx->noalias() = W * X;
Wx_plus_b->noalias() = Wx->colwise() + b;
}
}
By turning on Eigen::DontAlign
, we ensure that the layout seen from SIMD and non-SIMD TUs is the same -- it always uses malloc
instead of handmade_aligned_malloc
, and packet math operations that access the data know that they should use unaligned load & store instructions (since the non-SIMD caller hasn't necessarily over-aligned the allocated data).
It's somewhat awkward to spell everything using those kind of matrix types, but is a plausible amount of boilerplate to live with for some few core modules where we need a lot of flops throughput.
So that solves the layout problem (remember: only for SIMD; BLAS doesn't have layout problems).
For both SIMD and BLAS, we definitely do have real and problematic ODR violations.
The essential outcome we want is that for TUs that use non-default flags for Eigen (e.g., enabling SIMD or BLAS), the object code should behave as-if eigen was declared in an anonymous namespace { namespace Eigen { ... } }
-- that its mangled symbol names differ from the standard ones, and that they are private.
Ideally, things like the global variables for CPUID cache readouts (_ZZN5Eigen8internal20manage_caching_sizesENS_6ActionEPlS2_S2_E12m_cacheSizes
) would not be anonymous nor mangled, but it's probably fine either way.
I tried a few things to get around this with preprocessor funny business, but I'm pretty convinced all of those are destined to fail and be brittle over time. I think the only plausible way to fix this in source code is to work with eigen upstream to add EIGEN_NAMESPACE_BEGIN
preprocessor hooks, so that we could officially swap it for something different.
The other way to fix ODR is down near the linker. It looks like (at least on Ubuntu) doing objcopy --redefine-sym
and --rename-section
is enough to swap out the namespace. We'd also need to demote a few things to local visibility but that's common and easy. The other trick is to mark the functions defined in these "special" TUs with __attribute__((flatten))
so that they have as small of a symbol footprint as possible. I have a plausible prototype of this in hand, but it would require additional work to make it something safe to land on master.
We would also need to figure out the Darwin / macOS work-alike of objcopy
and if it has as good of a set of options for renaming symbols (and their relocations!).
Once we do all of that, it would be safe to use BLAS for those special TUs.
The upshot is that I think we have a narrow path forward to use Eigen+SIMD and/or Eigen+BLAS in Drake files. We'd need to do it one by one by refactoring files that matter into low-level helpers.
So the question is, are there any files where this investment would pay off? If people have benchmarks to share, that would be a good starting point.
As before (way upthread) -- developers can turn on SIMD and/or BLAS globally in their whole Drake build and make a benchmark comparison. That would be a best-case measurement of how far we could go. If that shows a big enough payoff, we can look at turning the flag on locally for just the one or two TUs that are the hot spot(s) in those benchmarks, and take it from there.
Wow, I just found clang-change-namespace at /usr/bin/clang-change-namespace-14
. This might even be able to supplant our vendor_cxx
?
\CC @rpoyner-tri FYI
I have forgotten why we can't use Eigen/SIMD for all of Drake. Is that because we need to be compatible with user code?
Yes, turning on any compiler SIMD option (-march=...
or -mavx
) that enables >= 32 byte packet alignment (AVX or greater) changes the memory layout as well as how malloc/free match up, so it has to be a global-setting for the whole build, so generally isn't available to us as library code.
(Users would build Drake from source for themselves are welcome to enable whatever flags they want. The limitation is on what we can ship in our official pre-compiled binary releases.)
@RussTedrake I've refined my build system prototype that safely compiles with different Eigen options to the point where I think it's going to be OK to use it.
The next step on this issue is to refactor the MLP benchmarking program to use googlebenchmark so that we can trust its results, and easily sweep over different operations and configurations of the MLP. Are you up for doing that, or could you recruit someone who can?
Thanks for pushing on this. I'll update the MLP benchmarking to use googlebench.
We should enable this for all of Drake if we can confirm with the Eigen developers that it is ABI compatible (will not cause ODR violations).
This was initially PR'd in #16683 with a local change to
multilayer_perceptron.cc
, but after discussing with @jwnimmer-tri we've decided to pull it out of that PR and either enable it for all of Drake or decide that we will not enable it.