Open loveshack opened 5 years ago
@loveshack Thanks for your contribution, Dave. I was not even aware of this W line of parts. (I was confused at first because "desktop SKX" seemed contradictory; up until now we knew all desktop Skylakes to be sans AVX-512. But it seems the W is for workstation, which makes sense in that it's targeted at configurations that want AVX-512 but don't necessarily have space for the server-grade part, and/or don't need as many cores.)
@devinamatthews Could you give your stamp of approval to this patch? @dnparikh seems to recall there being an issue of 1 VPU vs 2 VPUs, but I don't have any memory of this one way or another.
Comments on #351
up until now we knew all desktop Skylakes to be sans AVX-512
Except for Skylake-X i7 and i9! And Cannon Lake! And Cascade Lake!
Except for Skylake-X i7 and i9! And Cannon Lake! And Cascade Lake!
Ugh. I know nothing, then.
Thanks for your comments, Devin.
My Skylake-X i9-9980XE is also misidentified as Haswell.
@mratsim can you send the output of /proc/cpuinfo
or the equivalent on your platform?
Here you go:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz
stepping : 4
microcode : 0x2000043
cpu MHz : 1406.852
cache size : 25344 KB
physical id : 0
siblings : 36
core id : 0
cpu cores : 18
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 6002.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
The full output is at https://gist.github.com/mratsim/419062e11ee1f66daa62c7fe4c13dc5d
If that helps, for my own BLAS implementation purposes (see https://github.com/pytorch/pytorch/issues/26534#issuecomment-537269042), I only test if the CPU implements AVX512F (with CPUID instruction, leaf 7 -> ebx -> bit 16 is set), see my CPU detection code
@mratsim Can you run https://github.com/jeffhammond/vpu-count to see if that detects AVX-512 2x FMA correctly?
@fgvanzee @devinamatthews @loveshack I have to wonder if addressing my comment here (https://github.com/flame/blis/pull/351/files/94b34d38f6dffc074a4f12a1936c0ddba51f47ee..5597169dda1c4cb0c447cb589f8d5c2a5418a259#diff-81ef49aa7330af78381263dcf0acbea8) would solve this problem.
@mratsim For context, the reason BLIS detects your CPU as Haswell is that it incorrectly thinks it has only 1 FMA unit. The handful of SKX processors that have only 1 FMA should be treated as Haswell by BLIS, for reasons discussed in https://github.com/flame/blis/pull/351.
That's a good point, in my own code I assume that everyone who cared about numerical computing that got an AVX512 CPU was informed enough to get one with 2 FMA units.
Here is the output of the test.x script:
$ ./test.x
0x0: 16,756e6547,6c65746e,49656e69
Intel? yes
0x1: 50654,16400800,7ffefbbf,bfebfbff
signature: 0x050654
model: 0x55=85
family: 0x06=6
ext model: 0x05=5
Skylake server? yes
0x7: 0,d39ffffb,0,c000000
Skylake AVX-512 detected
cpu_name = Intel(R) Core(TM) i9-9980XE CPU
cpu_name[9] = C
cpu_name[17] =
CPU has 2 AVX-512 VPUs
The empirical script however sometimes give me 1, sometimes give me 2, but my CPU is overclocked 4.1 GHz all-core turbo, 4.0 GHz all-core AVX turbo, 3.5 GHz all-core AVX512 turbo, see perf profile) so that overclocking + a non-performance CPU governor might throw off the script.
My Skylake-X i9-9980XE is also misidentified as Haswell.
That should be fixed by the changes I submitted, which don't seem to be wanted. (If the CPU isn't recognized, the cpuid code might take a fraction of a second to run measurement code, but that needs a suitably-licensed implementation; Gromacs' has syntax for GCC, but is GPL. Otherwise, it's probably best just to default to the normal case of 2×FMA.)
For what it's worth, it looks as if OpenBLAS now has decent (but unreleased) skylakex support, so if you just want a good BLAS, it will probably be the best option.
@loveshack Your suggestion of OpenBLAS here is total garbage. Unless there has been a rewrite for SKX, it's nowhere near as fast on SKX. See SkylakeX on the Performance Wiki page for details.
Anyone can trivially fix the problem in this issue by setting the configuration name explicitly, which is how I just built a SKX binary on my HSW workstation. Using the build system options effectively is much easier than switch BLAS libraries.
./configure skx
I don't know what you are talking about with the licensing issues. vpu-count.c uses the MIT license. I am not going to ask for a license clarification on empirical.c because that method is worse in every way that matters to us. It is shown here once again giving wrong answers to @mratsim due to system noise.
@mratsim You are right that people buy the 2 FMA parts when they are building HPC systems, but there are a lot of academics and software developers at small firms who buy the low-end server CPUs in their workstations. I too was surprised but my activities on this front were motivated by reports of strange DGEMM performance with Skylake Xeon 4xxx SKUs on GitHub.
@loveshack Your suggestion of OpenBLAS here is total garbage. Unless there has been a rewrite for SKX, it's nowhere near as fast on SKX.
Aggressively addressing something different when confronted with uncomfortable facts is the sort of tactic I expect from disreputable politicians. BLIS certainly knows how to drive off outside contributors.
I have actually compared development BLIS and OpenBLAS -- believe it or not -- rather than talking through my hat. (The releases both use avx2 on my W-series box, but OpenBLAS does rather better.) The OpenBLAS author also claims to be able to outperform MKL on avx2.
Anyone can trivially fix the problem in this issue by setting the configuration name explicitly. Using the build system options effectively is much easier than switch BLAS libraies.
First you have to understand the undocumented issue, then on heterogeneous systems you need to build N copies of the library and ensure they're correctly used at run time. I see that sort of mess, and the consequences. [Switching ELF dynamically-linked BLAS is trivial, and is supported by my packaging.]
On the other hand, I contributed code to fix this issue in all the cases I could find, to diagnose it, and to override the micro-arch selection dynamically.
I don't know what you are talking about with the licensing issues. vpu-count.c uses the MIT license.
https://github.com/jeffhammond/vpu-count/blob/master/empirical.c is not MIT licensed according to the header. It also won't compile with GCC.
I am not going to ask for a license change on empirical.c because that method is worse in every way that matters to us. It is shown here once again giving wrong answers to @mratsim due to system noise.
If it's junk it would be helpful to warn potential users. The Gromacs version appears more robust; it uses a different timer. Anyhow, all bets are off for performance under such conditions, and it would only be a fallback if you're going to default to assuming one FMA unit.
Since you raise variance, note that I'm entitled to ignore measurements without error bars, like the published BLIS ones.
@loveshack Your suggestion of OpenBLAS here is total garbage. Unless there has been a rewrite for SKX, it's nowhere near as fast on SKX.
Aggressively addressing something different when confronted with uncomfortable facts is the sort of tactic I expect from disreputable politicians. BLIS certainly knows how to drive off outside contributors.
Please note that I am an outside contributor.
I have actually compared development BLIS and OpenBLAS -- believe it or not -- rather than talking through my hat. (The releases both use avx2 on my W-series box, but OpenBLAS does rather better.) The OpenBLAS author also claims to be able to outperform MKL on avx2.
Please post the data.
Anyone can trivially fix the problem in this issue by setting the configuration name explicitly. Using the build system options effectively is much easier than switch BLAS libraies.
First you have to understand the undocumented issue, then on heterogeneous systems you need to build N copies of the library and ensure they're correctly used at run time. I see that sort of mess, and the consequences. [Switching ELF dynamically-linked BLAS is trivial, and is supported by my packaging.]
What is not documented? Are you suggesting that the auto
and skx
configuration options are not properly documented?
I don't know what you are talking about with the licensing issues. vpu-count.c uses the MIT license.
https://github.com/jeffhammond/vpu-count/blob/master/empirical.c is not MIT licensed according to the header. It also won't compile with GCC.
True, but I am not telling anyone to use it, so why does it matter?
I am not going to ask for a license change on empirical.c because that method is worse in every way that matters to us. It is shown here once again giving wrong answers to @mratsim due to system noise.
If it's junk it would be helpful to warn potential users.
It is not junk. It just isn't recommended for most users.
https://github.com/jeffhammond/vpu-count/blob/master/README.md#usage has been modified to address this.
The Gromacs version appears more robust; it uses a different timer. Anyhow, all bets are off for performance under such conditions, and it would only be a fallback if you're going to default to assuming one FMA unit.
As I've said many times in the past, the default should be 2 FMA units on server platforms. The server parts with 1 FMA unit are the exception.
Since you raise variance, note that I'm entitled to ignore measurements without error bars, like the published BLIS ones.
This comment is not made in good faith and has been ignored.
For what it's worth, it looks as if OpenBLAS now has decent (but unreleased) skylakex support, so if you just want a good BLAS, it will probably be the best option.
Since we are veering into off-topic (but the original problem was yours, and it's understood with a potential fix underway), allow me to expand on my use of BLAS libraries.
I interact with BLAS / BLIS with 3 different hats.
I focus on data science workloads, while lots of compute-intensive part is offloaded on GPU, there are still many cases where a CPU BLAS is needed, for example Principal Component Analysis.
As I have an Intel CPU, linking to MKL gives me the best performance. I recompiled the latest OpenBLAS from source and it gave me 2.75TFlops on my machine, MKL reached 3.37TFlops and the theoretical peak is a 4032TFlops (3.5GHz all AVX512 turbo).
There is one library, which is the industry-standard in Natural Language Processing that requires BLIS called Spacy / https://github.com/explosion/spaCy. The reason is the flexibility in strides that other BLAS libraries don't provide, see https://github.com/explosion/cython-blis. Spacy is at 15k Github stars and used everywhere in NLP, I'd like to see BLIS correctly detect my CPU so that NLP workloads which are becoming huge in size (i.e. Wikipedia is in the Terabytes although it's usually processed on GPU) are using the full extent of my CPU.
I develop Arraymancer, a tensor library for the Nim programming language, think of it as Numpy + Sklearn + PyTorch but only for data science in terms of scope. Like many other libraries, I stand on top of BLAS and users can compile in any library they desire. I even provide a specialized BLIS backend and compilation flag that avoids making a tensor contiguous before doing a matrix multiplication. https://github.com/mratsim/Arraymancer/blob/v0.5.2/src/tensor/backend/blis_api.nim
I encountered the following difficulties with BLAS libraries, note that many issues are not under the hands of BLAS developers but ultimately as the user-facing library dev, it's me that have to deal with those:
All of the composability and deployment woes led me to 2 things:
The goal behind developing my own BLAS is to understand the intricacy of those.
Like many others I am using the BLIS approach instead of GotoBLAS/OpenBLAS approach for the ease of use:
The performance is also there.
On my CPU, my own BLAS reaches 2.7~2.8 TFlops, similar to Intel MKL+GNU OpenMP or OpenBLAS. There is a caveat though, this is only with GCC OpenMP, with Clang/ICC I only reach 2.1 TFlops, probably because the libkomp underneath doesn't properly support #pragma omp taskloop
which I found necessary to parallelize both the ic
and jr
loop. I couldn't find the trick that OpenBLAS / BLIS are using to parallelize multiple loops
In short, even if BLIS usage is lower than OpenBLAS or MKL, it is the leading learning platform and introduction to high-performance computing.
I could also extend BLIS to prepacked GEMM with minimal effort.
To tackle composability issues, vectorization, optimization and also autodifferentation I started to write my own linear algebra and deep learning compiler, implemented as an embedded DSL in Nim, so that it can be seamlessly used in user-code. I however quickly hit the limits of OpenMP again.
As I think OpenMP limits are fundamental, and also given the bad state of some features in one runtime or the other (no work-stealing in GCC, no taskloop in Clang/ICC), I developed my own multithreading runtime from scratch, Weave, with the goal to be the backend of my high-performance libraries.
The runtime is now pretty solid, I ported my own BLAS to it and can reach 2.65 TFlops, with nested parallelism. There is overhead in work-stealing that GCC OpenMP doesn't have but in contrast I don't suffer from load imbalance, threashold or grain size issue that are plaguing PyTorch: https://github.com/zy97140/omp-benchmark-for-pytorch.
So now I want to benchmark that runtime against BLIS approach to parallelization to check if besides my core kernel, what is the state-of-the-art speedup (time parallel / time serial) that the BLIS approach brings. As a comparison, Intel MKL + Intel OpenMP speedup is 15\~16x, OpenBLAS is 14x while my runtime is at 15\~15.5x (if I allow workers to backoff when they can't steal work) or 16.9x (if I don't allow them to backoff).
@mratsim w.r.t. threading I have been meaning for some time to port BLIS to my TCI threading library that I use in TBLIS. This library can use either thread-based (OpenMP, pthreads) or task-based (TBB, GCD, PPL) back-ends. On Haswell TBLIS+TBB can beat MKL+TBB quite handily (perf. only a few % lower than OpenMP), although KNL had some teething issues. Haven't tested SKX but I would be hopeful. I would also be interested in seeing if Weave is something that could be used as a back-end in TCI.
w.r.t. the rest, I really don't see BLIS as a BLAS library (role 1)--MKL is free, so why the heck wouldn't people use that? What is unique about BLIS is a) for library developers (role 2) you get great interface extensions like general stride and now mixed-domain and mixed-precision, 1m, and more to come in the future, and b) for low-level developers (role 3) you get a nice toolbox of linear algebra pieces to build new operations with (this isn't the easiest thing right now, we are working on making this much more powerful in the future). For example, I don't really care at all about GEMM; I care about tensor contraction, row- and column-scaled multiplication, three-matrix SYRK-like multiplication, GEMM-then-cwise-dot, etc. that don't even have standard interfaces or existing high-performance implementations.
Folks,
Let me second that. With BLIS, we have always been willing to give up 5% for flexibility, maintainability, and extendability. We are always delighted when people take building blocks or ideas from BLIS and build their own. We don’t feel threatened when others opt for other solutions or roll their own. One of our greatest delights comes from people realizing that providing BLAS-like functionality is not just for experts. Indeed, we have a MOOC for that (which will start again on Jan. 15: https://www.edx.org/course/laff-on-programming-for-high-performance). Competition is a wonderful thing.
And on quite a few occasions, BLIS is the fastest, as a bonus.
Have a BLISful New Year Robert
On Jan 3, 2020, at 12:18 PM, Devin Matthews notifications@github.com wrote:
@mratsim https://github.com/mratsim w.r.t. threading I have been meaning for some time to port BLIS to my TCI https://github.com/devinamatthews/tci threading library that I use in TBLIS https://github.com/devinasmatthews/tblis. This library can use either thread-based (OpenMP, pthreads) or task-based (TBB, GCD, PPL) back-ends. On Haswell TBLIS+TBB can beat MKL+TBB quite handily (perf. only a few % lower than OpenMP), although KNL had some teething issues. Haven't tested SKX but I would be hopeful. I would also be interested in seeing if Weave is something that could be used as a back-end in TCI.
w.r.t. the rest, I really don't see BLIS as a BLAS library (role 1)--MKL is free, so why the heck wouldn't people use that? What is unique about BLIS is a) for library developers (role 2) you get great interface extensions like general stride and now mixed-domain and mixed-precision, 1m, and more to come in the future, and b) for low-level developers (role 3) you get a nice toolbox of linear algebra pieces to build new operations with (this isn't the easiest thing right now, we are working on making this much more powerful in the future). For example, I don't really care at all about GEMM; I care about tensor contraction, row- and column-scaled multiplication, three-matrix SYRK-like multiplication, GEMM-then-cwise-dot, etc. that don't even have standard interfaces or existing high-performance implementations.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/flame/blis/issues/352?email_source=notifications&email_token=ABLLYJ53UDDAGWX7QNQQBJLQ35XPTA5CNFSM4I6GJCA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIBTZKQ#issuecomment-570637482, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLLYJYG6FQ26TANRO24T5TQ35XPTANCNFSM4I6GJCAQ.
@mratsim is your CPU still misidentified? If so please send the full output of configure.
Both test.x and empirical.x properly detect 2 VPUs (as of commit https://github.com/jeffhammond/vpu-count/commit/b20db6dd60679d3ae310069a1a571b29312bcc1c)
@mratsim The code is BLIS is slightly different from @jeffhammond's code. Can you test with BLIS? Configuring with configure auto
should show that it selects the skx2
sub-configuration.
As of commit 9c5b485d356367b0a1288761cd623f52036e7344 this is my ./configure auto
output
configure: detected Linux kernel version 5.7.12-arch1-1.
configure: python interpeter search list is: python python3 python2.
configure: using 'python' python interpreter.
configure: found python version 3.8.5 (maj: 3, min: 8, rev: 5).
configure: python 3.8.5 appears to be supported.
configure: C compiler search list is: gcc clang cc.
configure: using 'gcc' C compiler.
configure: C++ compiler search list is: g++ clang++ c++.
configure: using 'g++' C++ compiler (for sandbox only).
configure: found gcc version 10.1.0 (maj: 10, min: 1, rev: 0).
configure: checking for blacklisted configurations due to gcc 10.1.0.
configure: checking gcc 10.1.0 against known consequential version ranges.
configure: found assembler ('as') version 2.34.0 (maj: 2, min: 34, rev: 0).
configure: checking for blacklisted configurations due to as 2.34.0.
configure: reading configuration registry...done.
configure: determining default version string.
configure: found '.git' directory; assuming git clone.
configure: executing: git describe --tags.
configure: got back 0.7.0-38-g9c5b485d.
configure: truncating to 0.7.0-38.
configure: starting configuration of BLIS 0.7.0-38.
configure: configuring with official version string.
configure: found shared library .so version '3.0.0'.
configure: .so major version: 3
configure: .so minor.build version: 0.0
configure: automatic configuration requested.
configure: hardware detection driver returned 'skx'.
configure: checking configuration against contents of 'config_registry'.
configure: configuration 'skx' is registered.
configure: 'skx' is defined as having the following sub-configurations:
configure: skx
configure: which collectively require the following kernels:
configure: skx haswell zen
configure: checking sub-configurations:
configure: 'skx' is registered...and exists.
configure: checking sub-configurations' requisite kernels:
configure: 'skx' kernels...exist.
configure: 'haswell' kernels...exist.
configure: 'zen' kernels...exist.
configure: no install prefix option given; defaulting to '/usr/local'.
configure: no install exec_prefix option given; defaulting to PREFIX.
configure: no install libdir option given; defaulting to EXECPREFIX/lib.
configure: no install includedir option given; defaulting to PREFIX/include.
configure: no install sharedir option given; defaulting to PREFIX/share.
configure: final installation directories:
configure: prefix: /usr/local
configure: exec_prefix: ${prefix}
configure: libdir: ${exec_prefix}/lib
configure: includedir: ${prefix}/include
configure: sharedir: ${prefix}/share
configure: NOTE: the variables above can be overridden when running make.
configure: no preset CFLAGS detected.
configure: no preset LDFLAGS detected.
configure: debug symbols disabled.
configure: disabling verbose make output. (enable with 'make V=1'.)
configure: disabling ARG_MAX hack.
configure: building BLIS as both static and shared libraries.
configure: exporting only public symbols within shared library.
configure: threading is disabled.
configure: requesting slab threading in jr and ir loops.
configure: internal memory pools for packing blocks are enabled.
configure: internal memory pools for small blocks are enabled.
configure: memory tracing output is disabled.
configure: libmemkind not found; disabling.
configure: compiler appears to support #pragma omp simd.
configure: the BLAS compatibility layer is enabled.
configure: the CBLAS compatibility layer is disabled.
configure: mixed datatype support is enabled.
configure: mixed datatype optimizations requiring extra memory are enabled.
configure: small matrix handling is enabled.
configure: the BLIS API integer size is automatically determined.
configure: the BLAS/CBLAS API integer size is 32-bit.
configure: configuring for conventional gemm implementation.
configure: creating ./config.mk from ./build/config.mk.in
configure: creating ./bli_config.h from ./build/bli_config.h.in
configure: creating ./obj/skx
configure: creating ./obj/skx/config/skx
configure: creating ./obj/skx/kernels/skx
configure: creating ./obj/skx/kernels/haswell
configure: creating ./obj/skx/kernels/zen
configure: creating ./obj/skx/ref_kernels/skx
configure: creating ./obj/skx/frame
configure: creating ./obj/skx/blastest
configure: creating ./obj/skx/testsuite
configure: creating ./lib/skx
configure: creating ./include/skx
configure: mirroring ./config/skx to ./obj/skx/config/skx
configure: mirroring ./kernels/skx to ./obj/skx/kernels/skx
configure: mirroring ./kernels/haswell to ./obj/skx/kernels/haswell
configure: mirroring ./kernels/zen to ./obj/skx/kernels/zen
configure: mirroring ./ref_kernels to ./obj/skx/ref_kernels
configure: mirroring ./ref_kernels to ./obj/skx/ref_kernels/skx
configure: mirroring ./frame to ./obj/skx/frame
configure: creating makefile fragments in ./obj/skx/config/skx
configure: creating makefile fragments in ./obj/skx/kernels/skx
configure: creating makefile fragments in ./obj/skx/kernels/haswell
configure: creating makefile fragments in ./obj/skx/kernels/zen
configure: creating makefile fragments in ./obj/skx/ref_kernels
configure: creating makefile fragments in ./obj/skx/frame
configure: configured to build within top-level directory of source distribution.
When grep-ing in the repo, I don't see skx2 anywhere. What branch should I use?
I have a desktop SKX, model W-2123, which the cpuid code identifies as haswell (obvious with
configure auto
).It turns out that it doesn't report avx512 vpus due to not parsing the model name. I fixed it with #351.