Closed devinamatthews closed 7 years ago
Don't put register zero in ukernel. Put it in library entry/exit. Not sure why you'd need it in ukernel. Dynamic config should assume AVX on x86 for sanity. What relevant professor doesn't support it? -- Jeff Hammond jeff.science@gmail.com http://jeffhammond.github.io/
@jeffhammond true, perhaps at the end of the macrokernel would be a good compromise (but isn't vzeroupper
just one cycle anyways?). The appeal of putting it somewhat deeper is that the framework, if compiled for SSE only, could trigger a transition before exiting the ABI.
Re assuming AVX (and/or vzeroupper
) for x86, we still support Dunnington, and of course on AMD vzeroupper
is actively harmful.
OK, so this does advise using vzeroupper
on AMD, for both directions of the transition. Since it is 1 cycle on Intel and only 6 cycles on AMD, putting one at the beginning and end of the macrokernel should be inconsequential. We could also use the __AVX__
macro for detection (assuming the macrokernel is compiled with proper flags for each arch).
But you don't have to support Dunnington in a library compiled for multiple dispatch. That's my point. Have an x86 library for pre-AVX that supports whatever SSE you use and a post-AVX library that dispatches as appropriate. Do you really think it makes sense to invest in code for pre-AVX processors for years to come? -- Jeff Hammond jeff.science@gmail.com http://jeffhammond.github.io/
So, I did an experiment (with TBLIS) where I timed:
sandybridge
- and dunnington
-based DGEMMs, andsandybridge
-only DGEMMsI then added a single vzeroupper
at the end of the sandybridge
ukernel and timed again. The results are quite surprising. The effect is uniformly positive for both, but much more pronounced for the AVX-only timing! This may be due to SSE instructions in the framework since it is compiled for a generic x86-64 arch.
Intel compiler assumes SSE2 baseline. So "generic" code will use SSE instructions and xmm registers for floating point.
It would be interesting to run the same test with the rest of the framework compiled for AVX.
SDE has options specifically for measuring SSE/AVX transition, by the way:
-ast Run the Intel(R) AVX/SSE transition checker
-oast Set the output file name for the Intel AVX/SSE
transition checker. Implies -ast
Default is "avx-sse-transition.out"
I think a vzeroupper
at the end of the sandybridge
and haswell
ukernels is probably the best move at this point. I can also test Skylake and maybe an AMD machine as well. @fgvanzee?
"AVX only no VZEROUPPER" seems to be a big upside. I will still be able to build that way, right?
No point: "AVX only no VZEROUPPER" (when built with AVX in the framework) == "AVX only with VZEROUPPER" (with or without AVX in the framework). The second plot shows speedup due to compiling the framework with AVX and is not a global comparison.
In fact, adding vzeroupper
still improves performance for m,n,k < 100.
Interestingly there is no effect from either vzeroupper
or compiling the framework with AVX on Skylake.
Intel processors (except Xeon Phi) have a transition penalty when going from AVX to non-VEX-encoded SSE code. This can be avoided by issuing a
vzeroupper
instruction before returning from the AVX-using function. See e.g. here for some interesting discussion. Some potential issues re BLIS:If we should call
vzeroupper
, where should it be? Perhaps we could add a configuration-dependent "cleanup" function that gets called before exit from each API function?Can we guarantee that there are no AVX<->SSE transitions within the BLIS framework? As of right now the answer is "yes", but with dynamic configuration, the bulk of the framework would need to be compiled with lowest common denominator architecture requirements (SSE!). So perhaps it would be necessary to include additional
vzeroupper
s even inside the framework...The AVX<->SSE penalty doesn't exist on Bulldozer, but what about Ryzen? (@fgvanzee?)
We will also need to investigate to what extent the compiler inserts
vzeroupper
in non-inline-assembly kernels (e.g. packing). Perhaps avzeroupper
at the end of the microkernel is all it takes?