Propose optional CPUARCH target AlderLakeAVX512, as aliase of SapphireRapids

FCLC commented 2 years ago

Context

It's been reasonably well documented that on alder lake, if, and only if, the GraceMount Ecore's are disabled, it becomes possible to enable all of the available AVX512 instruction available on the Golden Cove Pcore's the same core's used in SapphireRapids.

For relevant workloads, many of which OpenBLAS has AVX512 accelerated code paths, this can lead to a significant performance uplift.

As of release 0.3.19, even if ecores are disabled and AVX512 is available, the build system will not make use of them automatically.

Current workaround

The user has the option of either passing:

CFLAGS='-O3 -march=sapphirerapids -mno-amx-tile -mno-amx-int8 -mno-amx-bf16'
CXXFLAGS='-O3 -march=sapphirerapids -mno-amx-tile -mno-amx-int8 -mno-amx-bf16'
FFLAGS='-O3 -march=sapphirerapids -mno-amx-tile -mno-amx-int8 -mno-amx-bf16'

or alternatively passing

CFLAGS='-O3 -march=native -mavx512f -mavx512dq -mavx512ifma -mavx512cd -mavx512bw -mavx512vl -mavx512bf16 -mavx512vbmi -mavx512vbmi2 -mavx512vnni -mavx512bitalg -mavx512vpopcntdq -mavx512vp2intersect'
CXXFLAGS='-O3 -march=native -mavx512f -mavx512dq -mavx512ifma -mavx512cd -mavx512bw -mavx512vl -mavx512bf16 -mavx512vbmi -mavx512vbmi2 -mavx512vnni -mavx512bitalg -mavx512vpopcntdq -mavx512vp2intersect'
FFLAGS='-O3 -march=native -mavx512f -mavx512dq -mavx512ifma -mavx512cd -mavx512bw -mavx512vl -mavx512bf16 -mavx512vbmi -mavx512vbmi2 -mavx512vnni -mavx512bitalg -mavx512vpopcntdq -mavx512vp2intersect'

NOTE: It is preferred to use the -march=sapphirerapids option as GCC, Clang, LLVM based ICX and ICC will then preserve the relevant AVX512 cost functions for auto-vectorization vs other instruction pathways, instead of generic costs meant as a catch all across all supported architectures. NOTE: We must disable the AMX tile and AMX instructions as the additional hardware was not built into Alder Lake.

Proposed solution/Request

I'd like to propose that, if the user supplies an architecture flag AlderLakeAVX512, it be aliased to be enable all features of Sapphire Rapids not explicitly requiring the AMX Tile.

Sources on AVX512 support on alder lake:

Anandtech article

Phoronix Article OpenBenchmarking Results of disabling 8 ecores to enable AVX512 on 8 pcores

Testing of AVX512 per instruction cost and pipeline on Alderlake

discussion of AVX512 performance support and performance

martin-frbg commented 2 years ago

You should be able to achieve just that by building with TARGET=SAPPHIRERAPIDS already. The autodetection code will default to Haswell as it does not "know" if the Ecores are disabled (and going to stay that way)

FCLC commented 2 years ago

TARGET=SAPPHIRERAPIDS

Unless I'm mistaken, this would mean that AMX instructions would also be compiled into source, which are not supported on alder lake and would cause a crash if/when any routines call on them.

Specifically any calls that can be accelerated by int8 or bf-16 are expected to be offloaded to the AMX tile instead of being done in AVX 512 because of the much higher throughput of the AMX matrix instructions vs the AVX vector instructions.

see this slide from intel for example https://pbs.twimg.com/media/E9LHMgwWQAUlrdD?format=jpg&name=large

or alternatively this preproduction intel example video (length of video ~1:05 minutes total) https://www.youtube.com/watch?v=sFzg4zmIlc8

martin-frbg commented 2 years ago

Ah sorry, I replied to fast - should have written "TARGET=SKYLAKEX" (though an OpenBLAS built for SapphireRapids would probably work just as well, provided one does not actually throw any bfloat16 data at it) EDITED to add: ... but I do not have access to Alder Lake hardware so no idea how much performance would suffer from having the compiler use cost functions for the older SkylakeX architecture

FCLC commented 2 years ago

"TARGET=SKYLAKEX"

Yeah, Id thought of doing this as a work around, but this would also produce a sub-optimal build unfortunately for 2 reasons:

it would not take advantage of the additional AVX512 instructiotions added over Skylake
It would use the wrong cost functions (saw your edit, including this for the sake of documenting for others in this thread, but there's a little bit more to it, see below)

2a. This is also the reason we don't want to use march=native or march=alderlake +mavx[list of instructions to include], because the compiler will then select cost functions to balance between both the Golden Cove Pcores && the gracemount ecores, leaving performance on the table due to things like different ALU's, cache size's, having SMT etc. 2b. We'd also end up not using all available AVX instructions, since SKYLAKEX supports significantly less.

SkylakeX:

Intel Skylake Server CPU with 64-bit extensions, MOVBE, MMX,
SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, PKU,
AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA, BMI,
BMI2, F16C, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT,
XSAVEC, XSAVES, AVX512F, CLWB, AVX512VL, AVX512BW,
AVX512DQ and AVX512CD instruction set support.

SapphireRapids:

Intel sapphirerapids CPU with 64-bit extensions, MOVBE, MMX,
SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, PKU,
AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA, BMI,
BMI2, F16C, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT,
XSAVEC, XSAVES, AVX512F, CLWB, AVX512VL, AVX512BW,
AVX512DQ, AVX512CD, AVX512VNNI, AVX512BF16,
MOVDIRI, MOVDIR64B, AVX512VP2INTERSECT, EN-
QCMD, CLDEMOTE, PTWRITE, WAITPKG, SERIALIZE,
TSXLDTRK, UINTR, AMX-BF16, AMX-TILE, AMX-INT8 and
AVX-VNNI instruction set support.

of which the new instructions relevant for alder lake are:

AVX512VNNI, AVX512BF16,
MOVDIRI, MOVDIR64B, AVX512VP2INTERSECT, EN-
QCMD, CLDEMOTE, PTWRITE, WAITPKG, SERIALIZE,
TSXLDTRK, UINTR, and
AVX-VNNI

Which means we do have BF16 support in AVX, the same way as Sapphire rapids, but don't have int8 at all.

Edit: source for above: GCC-11.2 manual https://gcc.gnu.org/onlinedocs/gcc-11.2.0/gcc.pdf

Edit2: tangentially related, as part of testing and profiling different suites I created a script to detect and generate compiler flags for any supported AVX512 instructions detected on a host system, it can be pipped into other tools to force the use of instructions, but doesn't include the cost file as observed above. It's a workaround, but it does help. It can be found in my repository here: https://github.com/FCLC/Choosing-a-compiler-performance-testing-GCC_ICC_ICPX_NVCC_CLANG_HIP/blob/main/Binary_tree/detect_avx.sh

martin-frbg commented 2 years ago

Hmmm. As you appear to have the hardware, can you confirm that switching AVX512 on/off by whatever means gets accurately reflected in the capabilities it reports e.g. through /proc/cpuinfo (or more precisely through the cpuid bits that OpenBLAS already knows to check via the support_avx512() and support_avx512_bf16() helper functions in cpuid_x86.c (build-time) and driver/others/dynamic.c (run-time in DYNAMIC_ARCH builds) ? And is it safe to assume by now that the temporarily orphaned AVX512 hardware is producing correct results ?

brada4 commented 2 years ago

seems to be big-small CPUs where half of cores have no avx-512.

martin-frbg commented 2 years ago

seems to be big-small CPUs where half of cores have no avx-512.

Err, yes, but the point is that AVX512 appears to be completely disabled by default, until one disables the little guys in the BIOS - at which point it transforms into something in between a SkylakeX and a SapphireRapids cpu. If this is accurately reflected in the cpuid bits it would simplify checking for this particular (non-default) operating mode, and probably warrant a dedicated TARGET.

brada4 commented 2 years ago

We would need at least /proc/cpuinfo from both modes? If any of stepping codes change, it is already instrumented, and if CPUID bits change, then too, the only danger is asymmetric CPUID-s on same system?

FCLC commented 2 years ago

Hmmm. As you appear to have the hardware, can you confirm that switching AVX512 on/off by whatever means gets accurately reflected in the capabilities it reports e.g. through /proc/cpuinfo (or more precisely through the cpuid bits that OpenBLAS already knows to check via the support_avx512() and support_avx512_bf16() helper functions in cpuid_x86.c (build-time) and driver/others/dynamic.c (run-time in DYNAMIC_ARCH builds) ? And is it safe to assume by now that the temporarily orphaned AVX512 hardware is producing correct results ?

Yes, I can confirm that performance is correct (tested against known good results on each core over the course of multiple days with no results that deviated) and that, after a reboot and enabling or disabling instructions in bios instructions are reported via cpuid as having or not having the relevant instructions. (Depending on configuration)

The sources in the original post have some screen shots of the CPUID bits being changed for AVX512 enablement, but for convenience here's a system output:

processor   : 15
vendor_id   : GenuineIntel
cpu family  : 6
model       : 151
model name  : 12th Gen Intel(R) Core(TM) i7-12700K
stepping    : 2
microcode   : 0x15
cpu MHz     : 3600.000
cache size  : 25600 KB
physical id : 0
siblings    : 16
core id     : 28
cpu cores   : 8
apicid      : 57
initial apicid  : 57
fpu     : yes
fpu_exception   : yes
cpuid level : 32
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni avx512_bf16 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear serialize pconfig arch_lbr avx512_fp16 flush_l1d arch_capabilities
vmx flags   : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling usr_wait_pause
bugs        : spectre_v1 spectre_v2 spec_store_bypass swapgs
bogomips    : 7219.20
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

The procedure for enabling AVX512 support varies slightly based on mainboard vendor, but Asus MSI Gigabyte and AsRock have all committed (and delivered) support in bios for enabling AVX512 once the ecores are all disabled. This is exclusive to the z690 chipset as of now.

I'll have to pull the latest source for a clean build environment, but once completed I'll verify that OpenBLASs can properly detect the intructions for both build and runtime (how this then effects build flags and compiler costs would then need to be profiled I think?)

seems to be big-small CPUs where half of cores have no avx-512.

Correct, it's a hybrid architecture x86 CPU with different ISA support between the 2 cores. Still awaiting confirmation from intel, but it also seems that only the K class chips will have the ability to enable AVX-512.

functionally that means that we have 3 chips available: 8 core i9 12900k with 30MB of L3 8 core i7 12700k with 25MB of L3 6 core i5 12600k with 20MB of L3

The difference in cache is that the L3 is shared and is still enabled even when the ecores are disabled.

between a SkylakeX and a SapphireRapids cpu.

This is correct, though the per core performance would actually partially out pace that of the Sapphire rapids version due to higher available power budget per core (and therefore frequency) and slightly increased L3 per core due to the disabled e-cores still contributing public L3

Edit:

We would need at least /proc/cpuinfo from both modes? If any of stepping codes change, it is already instrumented, and if CPUID bits change, then too, the only danger is asymmetric CPUID-s on same system?

Would you like me to provide both?

FCLC commented 2 years ago

find attached both cpuinfo outputs as results of cat /proc/cpuinfo >> [file].txt cpu_info_ecores_enabled.txt cpu_info_avx512.txt

martin-frbg commented 2 years ago

very quick and dirty patch (Sapphire Rapids is currently treated like Cooper Lake and I just noticed autodetection does not yet have it as a separate type):

--- cpuid_x86.c 2021-12-19 14:07:28.243783197 +0100
+++ cpuid_x86.cnew      2021-12-20 22:50:44.629040294 +0100
@@ -1495,6 +1495,10 @@
         switch (model) {
         case 7: // Alder Lake desktop
         case 10: // Alder Lake mobile
+         if(support_avx512_bf16())
+            return CPUTYPE_COOPERLAKE;
+          if(support_avx512())
+            return CPUTYPE_SKYLAKEX;
           if(support_avx2())
             return CPUTYPE_HASWELL;
           if(support_avx())

wjc404 commented 2 years ago

How about the peak throughput of AVX512 instructions comparing to their AVX2 counterparts on Alder Lake? On some architectures, the dedicated avx512 unit on port 5 is disabled, leaving the theoretical GFLOPS of AVX512 the same as that of AVX2.

brada4 commented 2 years ago

Also it is documented that AVX-512 anywhere switches off turbo boost, so 1-core performance also should be compared, going back even to sandybridge (AVX1 only)

FCLC commented 2 years ago

How about the peak throughput of AVX512 instructions comparing to their AVX2 counterparts on Alder Lake? On some architectures, the dedicated avx512 unit on port 5 is disabled, leaving the theoretical GFLOPS of AVX512 the same as that of AVX2.

I've gotten as high as 5x throughput as compared to AVX2 on alder lake for code that is very AVX512 optimized such as particle simulations. In part this is due to how when the ecore's are disabled (a requirement to use avx512) the ring bus is then relative to the p core frequency. Typically ring frequency = (e core multiplier -3) * base clock (aka bclk) for the i9 12900k: Ecore on Max ring bus frequency of 3.6GHz, which also drives L3 cache for all cores, memory access etc.
Ecore off Max ring bus frequency of 4.9 GHz, which also drives L3 cache for all cores, memory access etc. (you also get the keep the L3 of the disabled ecores, meaning your per core L3 cache size goes up)

The per instruction throughput per clock with including avx512 has been formally measured and can be found here: https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel/GenuineIntel0090672_AlderLake_BC_AVX512_InstLatX64.txt As compared to the standard configuration without avx512 and the 8 ecores enabled https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel/GenuineIntel0090672_AlderLake_BC_InstLatX64.txt

Also it is documented that AVX-512 anywhere switches off turbo boost, so 1-core performance also should be compared, going back even to sandybridge (AVX1 only)

This isn't quite right, it's more that it makes turbo boost less effective.

This is dependent on generation, but I think you're referring to AVX offset. Typically a core that is executing AVX code and having to use XMM or ZMM registers generates more heat/draws more power. As a way of balancing power budget what Intel did is apply a net offset to the current turbo boost frequency. This applies for Turbo boost 2.0 and 3.0

i.e. If a chip has a base of 3.5, and is within temperature and power limits, and a load is applied, it will boost to say (base multiplier +11)* baseclock =4.6GHz.

TurboBoost 2.0 behavior (before Alder lake) As it executes, let's say the program then calls AVX2 code using the XMM registers (XMM is 128 bit, YMM is 256). These will typically offset current clock by -1 on multiplier, or (3.5 +11(turbo boost)-1(avx2 off set))*base clock= 4.5GHz

let's say that same program then requests AVX512 code, these use the ZMM registers. typically AVX512 is either a -2 or -3 multiplier offset. as above, we end up at say 4.3Ghz.

Now, if we then exceed/approach power/temperature limits, turbo boost will lessen until it stabilizes. In workstations with proper cooling this will be near maximum frequency -1-2, so we end up with a sustained AVX512 frequency of 4.1GHz

Alder Lake is a little more interesting, but suffice to say, in my testing so far, I've noticed that avx2 has no offset from base standard multiplier, and avx512 does not directly apply an offset, rather it informs the internal system frequency governor to be more "careful" with maxing out clocks. This means that, with a proper cooling set up, I can happily run 8 cores of AVX512 code on the i7 12700k at 5GHz sustained for a tested 24+hrs

Note: AVX2 and AVX1 are pretty much the same in terms of actual usage/performance, you could easily refer to avx2 as avx1.1 from how small the differences were.

Note: AVX512 on the Other hand is a slight pain to deal with from a low level/asm point of view. AVX 512 is not a complete instruction set, but rather a familly of instruction sets that only guarantees a single "common" core of instructions, and vendors can choose to implement other parts of the AVX512 family of instructions as they see fit.

The AVX512 Wikipedia page has a quite good chart showing what family of instructions are supported on what hardware

https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

wjc404 commented 2 years ago

Thank you for providing the links. It looks like the Alder Lake processor is still in half-throughput avx512 design (from comparing the throughput data of avx512/avx2 vfmadd231ps and vfmadd231pd), so probably no significant performance gain will be observed in SGEMM/DGEMM of large matrices when enabling avx512...

brada4 commented 2 years ago

This isn't quite right, it's more that it makes turbo boost less effective.

Nope, Haswell clocks to base frequency when AVX2 is engaged on any core, same applies to that modern lake with AVX-512. In Desktop/Laptop context that makes previous generation sometimes slightly better.

FCLC commented 2 years ago

avx512/avx2 vfmadd231ps and vfmadd231pd)

for those specific instructions you're right, it's not massive on the instruction itself, but we do save the 4 clock fetch latency from L1

2447 AVX512VL :{EVEX} VFMADD231PS xmm, xmm, xmm L: 1.26ns= 4.0c T: 0.16ns= 0.50c 2448 AVX512VL :{EVEX} VFMADD231PS ymm, ymm, ymm L: 1.26ns= 4.0c T: 0.16ns= 0.50c 2449 AVX512F :VFMADD231PS zmm, zmm, zmm L: 1.26ns= 4.0c T: 0.31ns= 1.00c

for 2048 bits

AVX1(XMM 128 bit registers) 16(4.0c+0.5c)-> 72c or ~22.32ns AVX2(YMM 256 bit registers) 8(4.0c+0.5c)-> 36c or ~11.16ns AVX512(ZMM 512bit registers) 4*(4.0c+1c)-> 20c or ~6.2ns

Granted it won't be quite this extreme because of being able to prefetch, but you do still end up making more efficient use of cache space and memory with AVX512, but the latency penalty is what dominates on alder lake.

Dug through my archive, here's some actual data on performance:

note: compiled using march=native on gcc 10.3 which does not have a target for Alder lake or Sapphire rapids.

The cost functions were therefore not quite right (as discussed/mentioned here https://github.com/xianyi/OpenBLAS/issues/3490#issuecomment-998015564 )

note:GCC 12, Clang14 are showing promising results as of now, and the "optimal" use case is to declare the architecture as march=sapphirerapids -mno-amx-tile

wjc404 commented 2 years ago

You can try testing 1-thread DGEMM of OpenBLAS with various problem sizes. I suspect that the speedup will be significant for MNK<=1000000(direct matmul kernels for SkylakeX, not yet for Haswell) but neglectible for MNK>=100000000 when avx512 is enabled:)

FCLC commented 2 years ago

sure, sounds fine. Unless I'm mistaken in my interpretation of the above, if so please let me know!

brada4 commented 2 years ago

GEMM uses assembly or compiler intrinsics, compiler scheduler is bypassed.

brada4 commented 2 years ago

There is no CPUID flag to tell if CPU has one or two AVX512 units, in former case there is no convincing performace gain from AVX-512 over AVX2 on same core.

FCLC commented 2 years ago

There is no CPUID flag to tell if CPU has one or two AVX512 units, in former case there is no convincing performace gain from AVX-512 over AVX2 on same core.

To the question of notable performance gains, please see attached tests from opening issue: https://openbenchmarking.org/result/2111077-TJ-ALDERAVX567

Or if preferred here's a battery of test from my own current testing environment: personal link to open bench marking

The P cores themselves on the chip have been confirmed by Intel to be the same Golden Cove core's as what is present on the upcoming Sapphire Rapids.

This includes the branch predictor, the amount of ALU's, the AVX units etc.

The Unofficial word from intel for the disabling of AVX-512 on the Alder Lake platform when E-cores are enabled is to to the Gracemount cores not supporting the same extended ISA.

In turn, due to the current generation OS schedulers and the hardware thread director not being up to the task of being able to direct programs to only run certain instruction (such as AVX-512 in our case) on certain cores/threads, the easier solution was taken.

The Solution was to blunt force disable the cores to be able to enable the extended instructions and therefore bypass the thread director, and then fall back to "normal" hardware assisted scheduling.

martin-frbg commented 2 years ago

I plan to merge the AVX512-specific autodetection change I outlined earlier, but I'm not convinced of the need for a specific build TARGET (at least until the small gap in supported instruction sets w.r.t SapphireRapids becomes relevant). (BTW of your battery of tests I assume the NumPy one will be the most relevant - assuming that it is using a NumPy built to use OpenBLAS ?)

brada4 commented 2 years ago

https://www.phoronix.com/scan.php?page=article&item=rocket-lake-avx512&num=2 lc0 is OpenBLAS, avx512 advantage does not justify less cores.

FCLC commented 2 years ago

https://www.phoronix.com/scan.php?page=article&item=rocket-lake-avx512&num=2 lc0 is OpenBLAS, avx512 advantage does not justify less cores.

The linked test are of the 8 core 11900k from the rocket lake generation with AVX-512 equivalent to the icelake xeons. That is not the architecture in question.

If the idea is illustration/comparison against the 10th series comet lake 10900k(s) where we went from 10 cores with AVX-2 only and how that was detrimental in some workloads, I agree with you.

Where this differs is how, especially in the case of the i7, the ecore's can often be a net detriment to performance.

This is due to a few factors, excluding AVX-512.

The largest of these factors is that the ring bus on Alder Lake is driven by a net offset of the lowest clocked active core, but the ring is shared for both the ecores && pcores.

When ecores are enabled, the ring bus, which drives Dram, L3 cache and other shared subsystems is downclocked relative to them.

Typically this means the ring will run at no higher than 3.2 GHz.

When disabled, the ring clocks offset to the lowest clocked P core, with boosting frequencies typically closer to 4.8-4.9 GHz.

It also free's up the shared L3 cache normally given priority to the e-cores, increasing the effective amount of per core cache available.

Beyond the above, there's the issue of modern schedulers and iterative solvers not being built to deal with vastly different per core performance on the same host.

On a different distributed Host, sure, but when core A, a P core with SMT, can compute certain tasks 40+% faster than the core L, an ecore without SMT, but the solvers scheduler doesn't know to add different weights per queue per core, you end up with a net loss in performance, even if on the i7 you're "dropping" from 12 cores (8p+4e) to 8p.

(Then comes the question of, since we have AVX-512 built into the chip, and there's a net benefit, why not enable it?)

Now, as for the specifics of the target being aliased to Sapphire Rapids, it comes down to SR and Alder Lake both being built on the same Golden Cove core. In the case of SR it's the only core on board, whereas for AL it's used for the P-core.

This means that any core architecture specific optimizations found for SR should also then be portable to AL.

On the GCC, kernel and LLVM mailing lists amongst other we've already seen some optimizations come in that are identical in code paths for AL and SR for all non AVX related maters.

Any code that is targeted for and has specific SR AVX-512 optimizations courtesy of Intel provides a net executions time benefit then when left to use default instruction cost tables or falling back on arch=icelake/canonlake etc.

In short, as for most optimized libraries, I want to have an option available for workstation users that know what they have available to them to squeeze every last drop of performance out of their system prior to sending to cluster

Beyond an auto detection mechanism, the request for an alias for SR in specific is to guarantee the use of the SR optimized code paths and instruction cost tables compiler side while building. Over time, if/when SR specific optimizations are contributed, those few of us running an AVX-512 AL configuration would see the benefit.

Part of the reason for this is to preview and begin optimizing different suites for the new core architecture of SR, using AL as an analogue to do so while we await its release to market near the end of q2.

I'm more than happy to build and serve as a guinea pig if that's deemed useful

martin-frbg commented 2 years ago

Why the need for an alias, rather than let the user specify TARGET=SAPPHIRERAPIDS (or not specify a TARGET and have the build script figure out on its own that it is appropriate to build for SR) ? Do you expect or know the cost tables to be dramatically different, requiring -march=alderlake rather than sapphirerapids ?

FCLC commented 2 years ago

Why the need for an alias, rather than let the user specify TARGET=SAPPHIRERAPIDS (or not specify a TARGET and have the build script figure out on its own that it is appropriate to build for SR) ? Do you expect or know the cost tables to be dramatically different, requiring -march=alderlake rather than sapphirerapids ?

The Issue I expect to see is that if and when OpenBLAS begins to implement AMX specific instructions into it's routines for accelerating things like int-8 and bf-16, as of now those would cause the creation of a binary that could not run on alderlake.

Alder lake and SR will support int-8 and bf-16 via AVX-512 targets, but SR is incorporating what intel is calling a "tile" on board which acts as a dedicated matrix engine via an on die interconnect. It's similar to the way that AMD has been connecting "Chiplets" to each other via their "infinity fabric".

The reason for wanting/needing to specify -march=sapphirerapids over -march=alderlake is that alder lake has no cost tables associated with AVX-512 at all, and will actually disable the instructions usage in GCC 11-12, Clang 11-14 and Intel ICC/ICX.

for now the work around for most of my testing has been to use

CFLAGS='-O3 -march=sapphirerapids -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -$ExtraCFlags'
CXXFLAGS='-O3 -march=sapphirerapids -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -$ExtraCXXFlags'
FFLAGS='-O3 -march=sapphirerapids -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -$ExtraFFlags'

or the equivalent for a given compiler.

Where -$ExtraCFlags for certain builds using GCC 11.2 or GCC12 pre-release on certain memory sensitive applications will be something along the lines of -fallow-store-data-races -fgcse-las -fgcse-after-reload -fdevirtualize-at-ltrans -fdevirtualize-speculatively -fsched-spec-load-dangerous -fsched-spec-load -fsemantic-interposition -fgraphite-identity -floop-nest-optimize -ftree-loop-im -ftree-loop-ivcanon -fivopts -ftree-vectorize -flto -fwhole-program -fuse-linker-plugin -funroll-loops -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -ltcmalloc_minimal -std=c17

to maximize throughput while also making use of the TCmalloc implementation of C malloc instead of that of GLIBC. (different topic but there seems to be some regressions in Glibcs' memory allocation sub routines and how they interact with heavily vectorized workloads across most architecture I've tested against) (HOARD and JE malloc are also viable, but that's very much OT)

martin-frbg commented 2 years ago

So -march the same as SR, and the only concern is a possible later addition of AMX to the current SR kernels ? In that case I think this can be handled if and when that happens - at that point there'd either be a need to split off the "pre-AMX" versions for Alder Lake, or guarding the additions with an ifdef HAVE_AMX. (Possibly the latter would be preferable, if Intel starts to market lower-end SR spin-offs with disabled AMX tiles (due to manufacturing defects or power budget) at some point)

FCLC commented 2 years ago

So -march the same as SR, and the only concern is a possible later addition of AMX to the current SR kernels ? In that case I think this can be handled if and when that happens - at that point there'd either be a need to split off the "pre-AMX" versions for Alder Lake, or guarding the additions with an ifdef HAVE_AMX.

So for now keep as we are and use -SR?

That's fine by me.

Part of this is me trying to get ahead in some projects/libraries for weirdness in hardware that can/will be coming down the line including workstation based chips on the alder lake platform/socket that will not have AMX tiles but will otherwise present themselves as SR.

The designated chipset is W680 but not much other information is currently public.

martin-frbg commented 2 years ago

Right - with the (equivalent of the) above patch also applied to the runtime cpu detection, the avx512 Alder Lake will also be treated as SR in DYNAMIC_ARCH builds (as usually included with Linux distributions, NumPy, etc.) And I'll add the HAVE_AMX property (based on cpuid flags) for future use. Sometime this weekend...

ZhennanWu commented 1 year ago

Just dropping by. I also have a big-core only Alder Lake machine (i7-12700k with Asrock bios). When comparing gcc flags of native vs sapphirerapids without amx, I found:

$ gcc -march=native -Q --help=target > gcc_target_native
$ gcc -march=sapphirerapids -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -Q --help=target > gcc_target_spr_no_amx
$ diff gcc_target_native gcc_target_spr_no_amx
12c12
<   -mabm                               [enabled]
---
>   -mabm                               [disabled]
27c27
<   -march=                             alderlake
---
>   -march=                             sapphirerapids
49c49
<   -mavx512vp2intersect                [enabled]
---
>   -mavx512vp2intersect                [disabled]
59c59
<   -mcldemote                          [disabled]
---
>   -mcldemote                          [enabled]
71c71
<   -menqcmd                            [disabled]
---
>   -menqcmd                            [enabled]
92,93c92,93
<   -mhle                               [disabled]
<   -mhreset                            [enabled]
---
>   -mhle                               [enabled]
>   -mhreset                            [disabled]
143c143
<   -mprefer-vector-width=              none
---
>   -mprefer-vector-width=              256
163c163
<   -msgx                               [disabled]
---
>   -msgx                               [enabled]
165c165
<   -mshstk                             [enabled]
---
>   -mshstk                             [disabled]
191c191
<   -mtsxldtrk                          [disabled]
---
>   -mtsxldtrk                          [enabled]
193c193
<   -mtune=                             alderlake
---
>   -mtune=                             sapphirerapids
195c195
<   -muintr                             [disabled]
---
>   -muintr                             [enabled]
202c202
<   -mwbnoinvd                          [disabled]
---
>   -mwbnoinvd                          [enabled]

This suggests that on top of AMX there are more instruction incompatibilities between sapphirerapids and alderlake. It seems to suggest we should use

-march=sapphirerapids -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mabm -mavx512vp2intersect -mno-cldemote -mno-enqcmd -mno-hle -mhreset -mno-sgx -mshstk -mno-uintr -mno-wbnoinvd

However, even with this following corrections, the -mhle inconsistency seems to be unfixable. -mno-hle fails to disable hardware lock elision if -march specifies a platform with HLE support. I'm not sure if it is by design in gcc or a gcc bug, please correct me if anyone know more about the implications of gcc HLE optimization.

$ gcc -march=sapphirerapids -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mabm -mavx512vp2intersect -mno-cldemote -mno-enqcmd -mno-hle -mhreset -mno-sgx -mshstk -mno-uintr -mno-wbnoinvd -Q --help=target > gcc_target_spr_edited
$ diff gcc_target_native gcc_target_spr_edited
27c27
<   -march=                             alderlake
---
>   -march=                             sapphirerapids
92c92
<   -mhle                               [disabled]
---
>   -mhle                               [enabled]
143c143
<   -mprefer-vector-width=              none
---
>   -mprefer-vector-width=              256
193c193
<   -mtune=                             alderlake
---
>   -mtune=                             sapphirerapids

The above testing is done on a Gentoo Linux with big-core only i7-12700k and gcc version 12.3.1 20230526 (Gentoo 12.3.1_p20230526 p2). If anyone is curious, the optimization flag I use for my setup was

-march=native -mtune=sapphirerapids --param=l1-cache-size=48 --param=l2-cache-size=25600

Which would fix the HLE issue.

brada4 commented 1 year ago

AVX-512 is disabled completely by later microcode as reflected in public specs. https://www.intel.com/content/www/us/en/products/sku/134594/intel-core-i712700k-processor-25m-cache-up-to-5-00-ghz/specifications.html

ZhennanWu commented 1 year ago

AVX-512 is disabled completely by later microcode as reflected in public specs.

True. I'm still using old microcode and the OpenBLAS AVX-512 patch for Alder Lake still works perfectly (and I hope it will continue so). I'm just replying to OP's compiler flag suggestions, which seems to leave some inconsistencies behind.

brada4 commented 1 year ago

HLE needs to be supported in pthreads via alternative code path invisibly to OpenBLAS. It is not something that OpenBLAS could call directly. It could be left enabled for sake of completeness, but will not improve numeric calculation by any means.

OpenMathLib / OpenBLAS