Shrink macos x86_64 blas library size

mattip commented 4 months ago

As @martin-frbg says in a comment elsewhere

Given that there never were Macs with AMD processors, or (AFAIK) with AVX-512, you could reduce the size of your library build by adding DYNAMIC_LIST="CORE2 NEHALEM SANDYBRIDGE HASWELL", removing the dedicated BLAS kernels for 10 irrelevant cpus (if you don't already do this). Probably no longer worth it for you, but I guess I should add this as the default for x86_64 builds with OSNAME=Darwin...

mattip commented 4 months ago

I wonder what else we could do to tweak the dynamic kernels we ship. Fro instance, can we boost the minimum x86_64 on linux from PRESCOTT?

rgommers commented 4 months ago

It would be worth summarizing what we are actually building in this repo I think. It looks like we are using DYNAMIC_ARCH=1 TARGET=PRESCOTT, but it's not exactly clear to me from the OpenBLAS README what that does. I am guessing "all architectures from PRESCOTT up" - but if so, that seems a little excessive?

Other questions I'd have:

What is the binary size impact of including/excluding a single architecture?
What happens on non-included architectures? Does it use the next-older target? If so, what performance is left on the table on average?

martin-frbg commented 4 months ago

TARGET=PRESCOTT in combination with DYNAMIC_ARCH is "use compiler options for Prescott when compiling the common code (thread setup, interfaces, LAPACK)", DYNAMIC _ARCH on x86_64 is a list of about 15 Intel and AMD cpus unless you specify your own subset via DYNAMIC_LIST. Can't give an exact answer for the per-model overhead, but it is something like 50 BLAS kernel objects plus parameters and function pointer table setup. Any non-included target gets supported by the next best available - again no exact figures but I'd guess at most 10 percent performance loss unless the fallback incurs restrictions like going from avx to sse.

rgommers commented 3 months ago

We discussed this a bit in the NumPy optimization team meeting yesterday. It seemed reasonable to everyone to build for fewer target architectures. When selecting the ones to ship, the expectation was that Haswell (first AVX2) and SkylakeX (first AVX512) would be important.

With target PRESCOTT (current default build flags on Linux):

$ CFLAGS="$CFLAGS -fvisibility=protected -Wno-uninitialized" make BUFFERSIZE=20 DYNAMIC_ARCH=1 USE_OPENMP=0 NUM_THREADS=64 \
    OBJCONV=$PWD/objconv/objconv SYMBOLPREFIX=scipy_ LIBNAMEPREFIX=scipy_ FIXED_LIBNAME=1 \
    TARGET=PRESCOTT
$ ls -lh libscipy_openblas.so
35M

With DYNAMIC_ARCH disabled:

$ CFLAGS="$CFLAGS -fvisibility=protected -Wno-uninitialized" make BUFFERSIZE=20 DYNAMIC_ARCH=0 USE_OPENMP=0 NUM_THREADS=64 \
    OBJCONV=$PWD/objconv/objconv SYMBOLPREFIX=scipy_ LIBNAMEPREFIX=scipy_ FIXED_LIBNAME=1
$ ls -lh libscipy_openblas.so
15M

With a custom selection (Prescott baseline, plus 3 newer architectures):

$ CFLAGS="$CFLAGS -fvisibility=protected -Wno-uninitialized" make BUFFERSIZE=20 DYNAMIC_ARCH=1 USE_OPENMP=0 NUM_THREADS=64 \
    OBJCONV=$PWD/objconv/objconv SYMBOLPREFIX=scipy_ LIBNAMEPREFIX=scipy_ FIXED_LIBNAME=1 \
    TARGET=PRESCOTT DYNAMIC_LIST="HASWELL SKYLAKEX SAPPHIRERAPIDS"
$ ls -lh libscipy_openblas.so
21M

Note: see Makefile.system for details on DYNAMIC_CORE/DYNAMIC_LIST to see how architectures are selected.

So it's about 1.5 MB shared library size extra per architecture that is built for. The compression factor for the shared library is about 3.5x. Meaning that the current contribution of libopenblas.so to x86-64 numpy/scipy wheels is currently ~9.5 MB, and if we'd go from 15 to say 5 architectures, we'd reduce wheel sizes by about 4 MB.

rgommers commented 3 months ago

Given the current traffic for numpy/scipy on PyPI, such a 4 MB reduction would saves about 17 PB/year (petabytes - how often do you get to use those:)) of download volume. I think we should make a selection of architectures based on what we know, then do some performance testing, and ship the reduced-size wheels unless we find a serious performance concern.

mattip commented 3 months ago

Makes sense to me. We should keep the PRESCOTT target for low-end processors, and add a few others based on some measure of middle-end and high-end processors. The aarch64 wheels, with only a few kernels shipped, are much smaller than the x86_64 ones.

rgommers commented 2 months ago

Something I only recently learned about is psABI level (https://gitlab.com/x86-psABIs/x86-64-ABI). This is what Linux distros have been using recently to select and deploy different optimization levels. The levels are:

For NumPy, SSE3 has been part of the baseline for quite a while now, so we're kinda halfway to x86-64-v2. v2 is still "very old machines", v3 probably roughly lines up with Haswell, and v4 with SkylakeX.

rgommers commented 2 months ago

For NumPy we pick different levels per target, making the performance/size tradeoff per set of functions based on benchmarking or knowledge of what instructions are used:

Generating multi-targets for "argfunc.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX2, SSE42, baseline
Generating multi-targets for "x86_simd_argsort.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX2
Generating multi-targets for "x86_simd_qsort.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX2
Generating multi-targets for "x86_simd_qsort_16bit.dispatch.h" 
  Enabled targets: AVX512_SPR, AVX512_ICL
Generating multi-targets for "highway_qsort.dispatch.h" 
  Enabled targets: 
Generating multi-targets for "highway_qsort_16bit.dispatch.h" 
  Enabled targets: 
Generating multi-targets for "loops_arithm_fp.dispatch.h" 
  Enabled targets: FMA3__AVX2, baseline
Generating multi-targets for "loops_arithmetic.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX512F, AVX2, SSE41, baseline
...
Generating multi-targets for "_simd.dispatch.h" 
  Enabled targets: SSE42, AVX2, FMA3, FMA3__AVX2, AVX512F, AVX512_SKX, baseline

Baseline is SSE3 as the highest level, which matches Prescott.

We should hopefully have some more benchmarking results to decide on this soon.

mattip commented 5 hours ago

Is there any resolution of a recommended set of DYNAMIC_LIST=??? that we can add for x86_64?

MacPython / openblas-libs

Shrink macos x86_64 blas library size #144