VZEROUPPER before return?

devinamatthews commented 7 years ago

Intel processors (except Xeon Phi) have a transition penalty when going from AVX to non-VEX-encoded SSE code. This can be avoided by issuing a vzeroupper instruction before returning from the AVX-using function. See e.g. here for some interesting discussion. Some potential issues re BLIS:

If we should call vzeroupper, where should it be? Perhaps we could add a configuration-dependent "cleanup" function that gets called before exit from each API function?
Can we guarantee that there are no AVX<->SSE transitions within the BLIS framework? As of right now the answer is "yes", but with dynamic configuration, the bulk of the framework would need to be compiled with lowest common denominator architecture requirements (SSE!). So perhaps it would be necessary to include additional vzerouppers even inside the framework...
The AVX<->SSE penalty doesn't exist on Bulldozer, but what about Ryzen? (@fgvanzee?)
We will also need to investigate to what extent the compiler inserts vzeroupper in non-inline-assembly kernels (e.g. packing). Perhaps a vzeroupper at the end of the microkernel is all it takes?

jeffhammond commented 7 years ago

Don't put register zero in ukernel. Put it in library entry/exit. Not sure why you'd need it in ukernel. Dynamic config should assume AVX on x86 for sanity. What relevant professor doesn't support it? -- Jeff Hammond jeff.science@gmail.com http://jeffhammond.github.io/

devinamatthews commented 7 years ago

@jeffhammond true, perhaps at the end of the macrokernel would be a good compromise (but isn't vzeroupper just one cycle anyways?). The appeal of putting it somewhat deeper is that the framework, if compiled for SSE only, could trigger a transition before exiting the ABI.

Re assuming AVX (and/or vzeroupper) for x86, we still support Dunnington, and of course on AMD vzeroupper is actively harmful.

devinamatthews commented 7 years ago

OK, so this does advise using vzeroupper on AMD, for both directions of the transition. Since it is 1 cycle on Intel and only 6 cycles on AMD, putting one at the beginning and end of the macrokernel should be inconsequential. We could also use the __AVX__ macro for detection (assuming the macrokernel is compiled with proper flags for each arch).

jeffhammond commented 7 years ago

But you don't have to support Dunnington in a library compiled for multiple dispatch. That's my point. Have an x86 library for pre-AVX that supports whatever SSE you use and a post-AVX library that dispatches as appropriate. Do you really think it makes sense to invest in code for pre-AVX processors for years to come? -- Jeff Hammond jeff.science@gmail.com http://jeffhammond.github.io/

devinamatthews commented 7 years ago

So, I did an experiment (with TBLIS) where I timed:

Interleaved sandybridge- and dunnington-based DGEMMs, and
An equal number of sandybridge-only DGEMMs

I then added a single vzeroupper at the end of the sandybridge ukernel and timed again. The results are quite surprising. The effect is uniformly positive for both, but much more pronounced for the AVX-only timing! This may be due to SSE instructions in the framework since it is compiled for a generic x86-64 arch.

jeffhammond commented 7 years ago

Intel compiler assumes SSE2 baseline. So "generic" code will use SSE instructions and xmm registers for floating point.

jeffhammond commented 7 years ago

It would be interesting to run the same test with the rest of the framework compiled for AVX.

jeffhammond commented 7 years ago

SDE has options specifically for measuring SSE/AVX transition, by the way:

     -ast                Run the Intel(R) AVX/SSE transition checker
     -oast               Set the output file name for the Intel AVX/SSE
                         transition checker. Implies -ast
                         Default is "avx-sse-transition.out"

devinamatthews commented 7 years ago

With the framework compiled for AVX (here and here), the speedup for the AVX-only DGEMM almost goes away as expected (still some repeatable effect for very small matrices). Speedup for mixed AVX+SSE is still significant though.

devinamatthews commented 7 years ago

I think a vzeroupper at the end of the sandybridge and haswell ukernels is probably the best move at this point. I can also test Skylake and maybe an AMD machine as well. @fgvanzee?

jeffhammond commented 7 years ago

"AVX only no VZEROUPPER" seems to be a big upside. I will still be able to build that way, right?

devinamatthews commented 7 years ago

No point: "AVX only no VZEROUPPER" (when built with AVX in the framework) == "AVX only with VZEROUPPER" (with or without AVX in the framework). The second plot shows speedup due to compiling the framework with AVX and is not a global comparison.

In fact, adding vzeroupper still improves performance for m,n,k < 100.

devinamatthews commented 7 years ago

Interestingly there is no effect from either vzeroupper or compiling the framework with AVX on Skylake.

flame / blis

VZEROUPPER before return? #149