Open mattip opened 4 months ago
I wonder what else we could do to tweak the dynamic kernels we ship. Fro instance, can we boost the minimum x86_64 on linux from PRESCOTT?
It would be worth summarizing what we are actually building in this repo I think. It looks like we are using DYNAMIC_ARCH=1 TARGET=PRESCOTT
, but it's not exactly clear to me from the OpenBLAS README what that does. I am guessing "all architectures from PRESCOTT up" - but if so, that seems a little excessive?
Other questions I'd have:
TARGET=PRESCOTT in combination with DYNAMIC_ARCH is "use compiler options for Prescott when compiling the common code (thread setup, interfaces, LAPACK)", DYNAMIC _ARCH on x86_64 is a list of about 15 Intel and AMD cpus unless you specify your own subset via DYNAMIC_LIST. Can't give an exact answer for the per-model overhead, but it is something like 50 BLAS kernel objects plus parameters and function pointer table setup. Any non-included target gets supported by the next best available - again no exact figures but I'd guess at most 10 percent performance loss unless the fallback incurs restrictions like going from avx to sse.
We discussed this a bit in the NumPy optimization team meeting yesterday. It seemed reasonable to everyone to build for fewer target architectures. When selecting the ones to ship, the expectation was that Haswell (first AVX2) and SkylakeX (first AVX512) would be important.
With target PRESCOTT
(current default build flags on Linux):
$ CFLAGS="$CFLAGS -fvisibility=protected -Wno-uninitialized" make BUFFERSIZE=20 DYNAMIC_ARCH=1 USE_OPENMP=0 NUM_THREADS=64 \
OBJCONV=$PWD/objconv/objconv SYMBOLPREFIX=scipy_ LIBNAMEPREFIX=scipy_ FIXED_LIBNAME=1 \
TARGET=PRESCOTT
$ ls -lh libscipy_openblas.so
35M
With DYNAMIC_ARCH
disabled:
$ CFLAGS="$CFLAGS -fvisibility=protected -Wno-uninitialized" make BUFFERSIZE=20 DYNAMIC_ARCH=0 USE_OPENMP=0 NUM_THREADS=64 \
OBJCONV=$PWD/objconv/objconv SYMBOLPREFIX=scipy_ LIBNAMEPREFIX=scipy_ FIXED_LIBNAME=1
$ ls -lh libscipy_openblas.so
15M
With a custom selection (Prescott baseline, plus 3 newer architectures):
$ CFLAGS="$CFLAGS -fvisibility=protected -Wno-uninitialized" make BUFFERSIZE=20 DYNAMIC_ARCH=1 USE_OPENMP=0 NUM_THREADS=64 \
OBJCONV=$PWD/objconv/objconv SYMBOLPREFIX=scipy_ LIBNAMEPREFIX=scipy_ FIXED_LIBNAME=1 \
TARGET=PRESCOTT DYNAMIC_LIST="HASWELL SKYLAKEX SAPPHIRERAPIDS"
$ ls -lh libscipy_openblas.so
21M
Note: see Makefile.system
for details on DYNAMIC_CORE
/DYNAMIC_LIST
to see how architectures are selected.
So it's about 1.5 MB shared library size extra per architecture that is built for. The compression factor for the shared library is about 3.5x. Meaning that the current contribution of libopenblas.so
to x86-64 numpy/scipy wheels is currently ~9.5 MB, and if we'd go from 15 to say 5 architectures, we'd reduce wheel sizes by about 4 MB.
Given the current traffic for numpy/scipy on PyPI, such a 4 MB reduction would saves about 17 PB/year (petabytes - how often do you get to use those:)) of download volume. I think we should make a selection of architectures based on what we know, then do some performance testing, and ship the reduced-size wheels unless we find a serious performance concern.
Makes sense to me. We should keep the PRESCOTT target for low-end processors, and add a few others based on some measure of middle-end and high-end processors. The aarch64 wheels, with only a few kernels shipped, are much smaller than the x86_64 ones.
Something I only recently learned about is psABI level (https://gitlab.com/x86-psABIs/x86-64-ABI). This is what Linux distros have been using recently to select and deploy different optimization levels. The levels are:
For NumPy, SSE3 has been part of the baseline for quite a while now, so we're kinda halfway to x86-64-v2
. v2
is still "very old machines", v3
probably roughly lines up with Haswell, and v4
with SkylakeX.
For NumPy we pick different levels per target, making the performance/size tradeoff per set of functions based on benchmarking or knowledge of what instructions are used:
Generating multi-targets for "argfunc.dispatch.h"
Enabled targets: AVX512_SKX, AVX2, SSE42, baseline
Generating multi-targets for "x86_simd_argsort.dispatch.h"
Enabled targets: AVX512_SKX, AVX2
Generating multi-targets for "x86_simd_qsort.dispatch.h"
Enabled targets: AVX512_SKX, AVX2
Generating multi-targets for "x86_simd_qsort_16bit.dispatch.h"
Enabled targets: AVX512_SPR, AVX512_ICL
Generating multi-targets for "highway_qsort.dispatch.h"
Enabled targets:
Generating multi-targets for "highway_qsort_16bit.dispatch.h"
Enabled targets:
Generating multi-targets for "loops_arithm_fp.dispatch.h"
Enabled targets: FMA3__AVX2, baseline
Generating multi-targets for "loops_arithmetic.dispatch.h"
Enabled targets: AVX512_SKX, AVX512F, AVX2, SSE41, baseline
...
Generating multi-targets for "_simd.dispatch.h"
Enabled targets: SSE42, AVX2, FMA3, FMA3__AVX2, AVX512F, AVX512_SKX, baseline
Baseline is SSE3 as the highest level, which matches Prescott.
We should hopefully have some more benchmarking results to decide on this soon.
Is there any resolution of a recommended set of DYNAMIC_LIST=???
that we can add for x86_64
?
As @martin-frbg says in a comment elsewhere