flame / blis

BLAS-like Library Instantiation Software Framework
Other
2.19k stars 361 forks source link

slow generic implementation #259

Open loveshack opened 5 years ago

loveshack commented 5 years ago

I was assuming that BLIS is generally better than reference BLAS, so substituting the latter with BLIS OS packages I'm working on would always be sensible. However, I found BLIS is more than two times slower for medium-sized dgemm on x86_64/RHEL7 for a "generic" build compared with the system reference blas package (which should be built with -O2 -mtune=generic, not -O3). I can't usefully test an architecture without a tuned implementation, but I don't see any reason to think that would be much different, though I haven't looked into the gcc optimization.

Is that expected, or might it be something worth investigating?

devinamatthews commented 5 years ago

The generic implementation will have better cache behavior than netlib BLAS, but will also do packing which will slow things down for small and medium-sized matrices. It's not totally clear from your comment whether or not this is the configuration that BLIS is using, please correct me if I am mistaken.

jeffhammond commented 5 years ago

@devinamatthews It may also be that Fortran is better than C :trollface:

loveshack commented 5 years ago

You wrote:

The generic implementation will have better cache behavior than netlib BLAS,

That's what I thought.

but will also do packing which will slow things down for small and medium-sized matrices.

but not about that. At what sort of size would that stop hurting (and I wonder if it could usefully be adaptive)? I tried 2000×2000 to run a few goes in a reasonable time. I've just tried 4000 square, which looks about the same.

It's not totally clear from your comment whether or not this is the configuration that BLIS is using, please correct me if I am mistaken.

I built a BLIS dynamic library with default flags and the generic target. I took the openblas dgemm benchmark (which actually linked against openblas), and ran it with either BLIS or reference BLAS LD_PRELOADed. Is that clearer?

I could examine the compilation results and profiles at some stage when I have more time, but thought it was worth asking the experts first -- thanks.

loveshack commented 5 years ago

You wrote:

@devinamatthews It may also be that Fortran is better than C :trollface:

Of course, but a sometime GNU Fortran maintainer knows how :-/.

devinamatthews commented 5 years ago

OK, I guess I'm not really clear why you care about the performance of the BLIS generic configuration. Even with cache blocking it will never be "high performance".

cdluminate commented 5 years ago

At least it is true that the builds on non-x86_64 architectures are slow due to the slow tests. https://launchpad.net/~lumin0/+archive/ubuntu/ppa/+sourcepub/9451410/+listing-archive-extra Click on the builds and there is time elapsed for the whole compiling+testing process.

fgvanzee commented 5 years ago

@cdluminate I took a look at some of the build times, as you suggest. It is true that the build time is excessive for your s390x build, for example (50 minutes, if I'm reading the output correctly). Much of that can be attributed to the fact that we do not have optimized kernels for every architecture. s390x is one of those unoptimized architectures. Still, this does feel a bit slow.

(Digression: If you would like to reduce the total build time, I recommend running the "fast" version of the BLIS testsuite, which is almost surely where most of the time is being spent. Right now, make test triggers the BLAS test drivers + the full BLIS testsuite. You can instead use make check, which runs the BLAS test drivers + a shortened version of the BLIS testsuite.)

However, strangely, your amd64 build still requires almost 19 minutes. That is still quite long. I just did a quick test on my 3.6GHz Broadwell. Targeting x86_64 at configure-time, I found that:

Perhaps your build hardware for the amd64 build is old? Or maybe oversubscribed?

An unrelated question: I assume that the name of your amd64 build refers generically to "the build for x86_64 microarchitectures," as it does in the Gentoo Linux world, and not AMD-specific hardware. Am I correct?

cdluminate commented 5 years ago

Debian tries to help upstream spot problems, not to build software as fast as possible. In order to build a reliable linux distribution it's not a good idea to skip too much tests. Hence the full testsuite is preferred for packaging.

As for amd64 build, my Intel I5-7440HQ runs the full test quite fast too. It's possible that Ubuntu uses old x86-64 machine in their buildfarm, but I'm not sure "old hardware" is the cause of "20 min" build time.

Debian's term amd64 always equals to x86_64. No matter what brand the physical CPU is.

fgvanzee commented 5 years ago

Debian tries to help upstream spot problems, not to build software as fast as possible. In order to build a reliable linux distribution it's not a good idea to skip too much tests. Hence the full testsuite is preferred for packaging.

That's fine. I often prefer the full testsuite in my own development, too, but I thought I would offer the faster alternative since many people in the past have been happy with avoiding many tests that are nearly identical to each other if it saves them 5-10x time.

As for amd64 build, my Intel I5-7440HQ runs the full test quite fast too. It's possible that Ubuntu uses old x86-64 machine in their buildfarm, but I'm not sure "old hardware" is the cause of "20 min" build time.

I'm glad you also see more normal build times. I see no need to worry, then, about the 20 minute build time on the Debian build hardware.

Debian's term amd64 always equals to x86_64. No matter what brand the physical CPU is.

Good, that's what I thought/expected. Thanks.

cdluminate commented 5 years ago

I'm glad you also see more normal build times. I see no need to worry, then, about the 20 minute build time on the Debian build hardware.

Just nitpicking: The launchpad, or PPA is Ubuntu's infrastructure, supported by business company Canonical. Debian is supported by independent community that theoretically doesn't rely on Ubuntu or Canonical.

The pages you see are not powered by Debian's build hardware. What I'm doing there is abusing Ubuntu's free build machines to build stuff on Ubuntu cosmic for testing Debian packages. (Ubuntu cosmic, or Ubuntu 18.10 is very close to Debian unstable. So testing Debian packages on Ubuntu machine sometimes makes sense).

fgvanzee commented 5 years ago

Just nitpicking: ...

Unlike most people, I will almost never be bothered by nitpicking! I like and appreciate nuance. :) Thanks for those details.

fgvanzee commented 5 years ago

BTW, since I don't use Debian, I have to rely on people like you and @nschloe for your expertise on these topics (understanding how we fit into the Debian/Ubuntu universes). Thanks again.

jeffhammond commented 5 years ago

Field:

Next time a vendor offers to donate hardware, you might ask for a big SSD so you can setup a virtual machine for every Linux distro. Just a thought.

fgvanzee commented 5 years ago

@jeffhammond In principle, I agree with you. However, this is the sort of thing that is not as practical now that our group is so small. (It also doesn't help that maintaining machines in our department comes with a non-trivial amount of cost and of red tape.) Instead, I'm going to channel you circa 2010 and say, "we look forward to your patch." And by that I mean, "someone doing it for us."

devinamatthews commented 5 years ago

@loveshack Returning to the original question: I think one way to make the "generic" implementation faster would be to add a fully-unrolled branch and temporary storage of C to the kernel, e.g.:

...
if (m == MR && n == NR)
{
    // unroll all MR*NR FMAs into temporaries
}
else
{
    // as usual
}
...
// accumulate at the end instead of along the way

and arrange for the reference kernel to be compiled with architecture-appropriate flags. The second issue means that e.g. a configuration without an optimized kernel would possibly run faster because of auto-vectorization, but that the actual generic configuration will probably still be very slow because it gets very conservative compiler flags.

loveshack commented 5 years ago

You wrote:

Debian tries to help upstream spot problems, not to build software as fast as possible. In order to build a reliable linux distribution it's not a good idea to skip too much tests. Hence the full testsuite is preferred for packaging.

For what it's worth, that's not what's normally done for Fedora. On the slower build platforms it would likely time out, and can perturb mass rebuilds considerably. I consider the "check" step in rpm builds basically as a sanity check, especially as in cases like this you can't test a relevant range of micro-architectures. [The make check target was added for that, but I also test the Fortran interface with gfortran, rather than relying on the f2c'd versions.]

For Fedora, I don't care about build times unless they're pathological, especially as they're very variable on the build VMs.

Debian's term amd64 always equals to x86_64. No matter what brand the physical CPU is.

[And for confusion, Fedora just uses x86_64 (which is probably less correct).]

loveshack commented 5 years ago

You wrote:

OK, I guess I'm not really clear why you care about the performance of the BLIS generic configuration. Even with cache blocking it will never be "high performance".

This is for OS packaging purposes. I assumed I could say that using BLIS would be strictly better than reference BLAS, i.e. the reference blas package is redundant for any platforms not supported by the blis or openblas packages (apart from compatibility tests).

loveshack commented 5 years ago

You wrote:

Field:

Next time a vendor offers to donate hardware, you might ask for a big SSD so you can setup a virtual machine for every Linux distro. Just a thought.

For what it's worth, I frequently spin up VMs with vagrant, which is mostly practical at least up to a cluster of three or so, on an 8GB/HDD laptop.

However, it's reasonable to leave specific distribution work to packagers, as long as the basic build system doesn't put obstacles in the way, and I think we've already got the relevant hooks like xFLAGS. thanks.

Also for what it's worth, I've tested rpm packaging for SuSE in the configurations supported by Fedora's copr as well as for the range of supported RHEL/Fedora targets, and my amd64 Debian desktop.

loveshack commented 5 years ago

You wrote:

@loveshack Returning to the original question: I think one way to make the "generic" implementation faster would be to add a fully-unrolled branch and temporary storage of C to the kernel, e.g.:

...
if (m == MR && n == NR)
{
    // unroll all MR*NR FMAs into temporaries
}
else
{
    // as usual
}
...
// accumulate at the end instead of along the way

and arrange for the reference kernel to be compiled with architecture-appropriate flags. The second issue means that e.g. a configuration without an optimized kernel would possibly run faster because of auto-vectorization, but that the actual generic configuration will probably still be very slow because it gets very conservative compiler flags.

I haven't had a chance to investigate further, but I did find that building generic with -march=native -Ofast -funroll-loops doesn't make a dramatic difference, not that -march=native can be used for packaging anyhow. (Part of the reason I expected BLIS to do better is that the -O3 it uses enables vectorization -- though only sse2 with -march=generic -- c.f. -O2 used for the reference blas package.) Then, I've never understood why compilers do so badly on, say, matmul.

devinamatthews commented 5 years ago

@loveshack What architectures in particular are you having a problem with?

fgvanzee commented 5 years ago

and arrange for the reference kernel to be compiled with architecture-appropriate flags. The second issue means that e.g. a configuration without an optimized kernel would possibly run faster because of auto-vectorization, but that the actual generic configuration will probably still be very slow because it gets very conservative compiler flags.

@devinamatthews It's not clear from context if you were under the impression that reference kernels were not already compiled with architecture-specific flags, but indeed they are. (Or maybe you are referring to a different variety of flags than I am.) Either way, make V=1 would confirm.

Or did you mention architecture-specific flags because you knew that @loveshack could not use -march=native and the like for packaging purposes?

devinamatthews commented 5 years ago

@fgvanzee I was mostly talking about the actual generic configuration vs. the reference kernel being used in a particular configuration.

fgvanzee commented 5 years ago

@devinamatthews Ah, makes sense. Thanks for clarifying. Yeah, generic doesn't do jack except use -O3, which I'm guessing in our world doesn't do much either.

jeffhammond commented 5 years ago

It might be interesting to see if simd pragmas cause anything better to happen with the reference kernel. I’ve got a list of all of those, in addition to the obvious OpenMP one.

loveshack commented 5 years ago

You wrote:

@loveshack What architectures in particular are you having a problem with?

The Fedora architectures that BLIS doesn't support I think are i686, ppc64, ppc64le, and s390x; there will be more in Debian. (OpenBLAS does those apart from ppc64, so we can at least use a free BLAS on most Fedora architectures.)

loveshack commented 5 years ago

You wrote:

@devinamatthews Ah, makes sense. Thanks for clarifying. Yeah, generic doesn't do jack except use -O3, which I'm guessing in our world doesn't do much either.

Yes, it doesn't make much difference experimentally (on x86_64), but you might expect it to help by including vectorization.

loveshack commented 5 years ago

You wrote:

It might be interesting to see if simd pragmas cause anything better to happen with the reference kernel. I’ve got a list of all of those, in addition to the obvious OpenMP one.

Yes, but I guess the first thing to do is to consult a detailed profile and gcc's optimization report. I'll have a look at it eventually, but I don't know whether results from x86_64 would be representative of other architectures I can't currently access. (I'll try to get on aarch64 and power8 at some stage.)

devinamatthews commented 5 years ago

i686, ppc64, ppc64le, and s390x

@loveshack For which of those architectures can we assume vectorization with the default flags?

fgvanzee commented 5 years ago

Yes, it doesn't make much difference experimentally (on x86_64), but you might expect it to help by including vectorization.

I might be willing to add such a flag or flags if you can recommend some that are relatively portable. And ideally, you would tell me the analogues of such flags on clang and icc, if applicable.

devinamatthews commented 5 years ago

@fgvanzee I would suggest:

  1. Changing the default MR and NR to 4x16, 4x8, 4x8, 4x4 (sdcz).
  2. Rewriting the reference gemm kernel to: a. be row-major, b. be fully unrolled in the k loop (this means you wouldn't be able to change MR/NR without writing a custom kernel but that seems reasonable), c. use temporary variables for C, and d. use restrict.
  3. Adding configurations for whatever is missing for packaging (s390x, ppc64, etc.) to get at least baseline vectorization flags for the reference kernels.

Rationale: rewriting the reference kernel this way should allow for a reasonable degree of auto-vectorization given the right flags. The larger kernels size and row-major layout would allow for 128b and 256b vectorization with a higher bandwidth from L1 than L2. I measure up to a 6x increase in performance for AVX2 in a quick mock test.

devinamatthews commented 5 years ago

FYI, even with a super obvious kernel (to a human), clang and gcc both fail miserably at auto-vectorization. Still better than not unrolled though.

void ukernel1(int k, const double* restrict a, const double* restrict b, double* restrict c, int cs_c)
{
    double c00 = 0, c01 = 0, c02 = 0, c03 = 0;
    double c04 = 0, c05 = 0, c06 = 0, c07 = 0;
    double c10 = 0, c11 = 0, c12 = 0, c13 = 0;
    double c14 = 0, c15 = 0, c16 = 0, c17 = 0;
    double c20 = 0, c21 = 0, c22 = 0, c23 = 0;
    double c24 = 0, c25 = 0, c26 = 0, c27 = 0;
    double c30 = 0, c31 = 0, c32 = 0, c33 = 0;
    double c34 = 0, c35 = 0, c36 = 0, c37 = 0;

    while (k --> 0)
    {
        double a0 = a[0];
        c00 += a0*b[0]; c01 += a0*b[1]; c02 += a0*b[2]; c03 += a0*b[3];
        c04 += a0*b[4]; c05 += a0*b[5]; c06 += a0*b[6]; c07 += a0*b[7];
        double a1 = a[1];
        c10 += a1*b[0]; c11 += a1*b[1]; c12 += a1*b[2]; c13 += a1*b[3];
        c14 += a1*b[4]; c15 += a1*b[5]; c16 += a1*b[6]; c17 += a1*b[7];
        double a2 = a[2];
        c20 += a2*b[0]; c21 += a2*b[1]; c22 += a2*b[2]; c23 += a2*b[3];
        c24 += a2*b[4]; c25 += a2*b[5]; c26 += a2*b[6]; c27 += a2*b[7];
        double a3 = a[3];
        c30 += a3*b[0]; c31 += a3*b[1]; c32 += a3*b[2]; c33 += a3*b[3];
        c34 += a3*b[4]; c35 += a3*b[5]; c36 += a3*b[6]; c37 += a3*b[7];
        a += 4;
        b += 8;
    }

    c[0*rs_c + 0] = c00; c[0*rs_c + 1] = c01;
    c[0*rs_c + 2] = c02; c[0*rs_c + 3] = c03;
    c[0*rs_c + 4] = c04; c[0*rs_c + 5] = c05;
    c[0*rs_c + 6] = c06; c[0*rs_c + 7] = c07;
    c[1*rs_c + 0] = c10; c[1*rs_c + 1] = c11;
    c[1*rs_c + 2] = c12; c[1*rs_c + 3] = c13;
    c[1*rs_c + 4] = c14; c[1*rs_c + 5] = c15;
    c[1*rs_c + 6] = c16; c[1*rs_c + 7] = c17;
    c[2*rs_c + 0] = c20; c[2*rs_c + 1] = c21;
    c[2*rs_c + 2] = c22; c[2*rs_c + 3] = c23;
    c[2*rs_c + 4] = c24; c[2*rs_c + 5] = c25;
    c[2*rs_c + 6] = c26; c[2*rs_c + 7] = c27;
    c[3*rs_c + 0] = c30; c[3*rs_c + 1] = c31;
    c[3*rs_c + 2] = c32; c[3*rs_c + 3] = c33;
    c[3*rs_c + 4] = c34; c[3*rs_c + 5] = c35;
    c[3*rs_c + 6] = c36; c[3*rs_c + 7] = c37;
}
fgvanzee commented 5 years ago

@devinamatthews Is this the code that gave you a 6x improvement over status quo reference microkernels?

devinamatthews commented 5 years ago

Basically, yes.

devinamatthews commented 5 years ago

gcc actually does a bit better with it not unrolled but with constant loop bounds. icc is the only one that can get halfway close to a nice vectorized ukernel.

fgvanzee commented 5 years ago

Basically, yes.

Was the 6x improvement observed from code compiled by gcc?

gcc actually does a bit better with it not unrolled but with constant loop bounds.

gcc does better with not-unrolled code with constant loop bounds, vs...

  1. the code you provided above?
  2. status quo reference microkernels?
hominhquan commented 5 years ago

@devinamatthews I just look by your code and see that we could use GNU vector-extension to write thing shorter. In my experience, this even allows gcc to dump efficiently SIMD instructions given by backend (with -ffast-math to get fma).

I just have no idea whether icc supports natively this kind of code, if not I think is always manageable with an intrinsic header.

PS: operating load/store with v8df can also enable 512-bit load/store HW instructions (if available), so get a little bit faster.

typedef double v8df __attribute__((__vector_size__ (64)));

void ukernel1(int k, const double* restrict a, const double* restrict b, 
    double* restrict c, int cs_c)
{
    v8df c07 = {0}; // c0-7
    v8df c17 = {0}; // c1-7
    v8df c27 = {0}; // c2-7
    v8df c37 = {0}; // c3-7

    while (k --> 0)
    {
        // expand each element of rows of a to vec-length
        v8df a07 = {a[0], a[0], a[0], a[0], a[0], a[0], a[0], a[0]};
        v8df a17 = {a[1], a[1], a[1], a[1], a[1], a[1], a[1], a[1]};
        v8df a27 = {a[2], a[2], a[2], a[2], a[2], a[2], a[2], a[2]};
        v8df a37 = {a[3], a[3], a[3], a[3], a[3], a[3], a[3], a[3]};

        // load b
        v8df b07 = *((v8df*)b);

        // -ffast-math
        c07 = (a07 * b07) + c07;
        c17 = (a17 * b07) + c17;
        c27 = (a27 * b07) + c27;
        c37 = (a37 * b07) + c37;

        a += 4;
        b += 8;
    }

    // store here: either contiguous if row-major or 
    // split v8df into smaller double if rs_c, cs_c
    // 
    // If contiguous row-major:
    // *((v8df*)&c[0*rs_c]) = c07; // EDIT: missing &
    // *((v8df*)&c[1*rs_c]) = c17;
    // *((v8df*)&c[2*rs_c]) = c27;
    // *((v8df*)&c[3*rs_c]) = c37;
    //
    // If split v8df:
    // c[0*rs_c + 0] = c07[0]; c[0*rs_c + 1] = c07[1];
    // c[0*rs_c + 2] = c07[2]; c[0*rs_c + 3] = c07[3];
    // c[0*rs_c + 4] = c07[4]; c[0*rs_c + 5] = c07[5];
    // c[0*rs_c + 6] = c07[6]; c[0*rs_c + 7] = c07[7];
    // and so on for c17, c27, c37
}
devinamatthews commented 5 years ago

@fgvanzee clang(unroll) > gcc(fixed-loop) > gcc(unroll) > clang(fixed-loop) >> ref for AVX2 at least

@hominhquan Is this extension portable to any architecture that gcc supports? Do you know if clang supports it?

devinamatthews commented 5 years ago

@hominhquan clang gets faster, but gcc totally tanks

loveshack commented 5 years ago

You wrote:

Yes, it doesn't make much difference experimentally (on x86_64), but you might expect it to help by including vectorization.

I might be willing to add such a flag or flags if you can recommend some that are relatively portable. And ideally, you would tell me the analogues of such flags on clang and icc, if applicable.

I think the icc default optimization level includes unrollig and vectorization.

loveshack commented 5 years ago

You wrote:

i686, ppc64, ppc64le, and s390x @loveshack For which of those architectures can we assume vectorization with the default flags?

GCC -O3 turns on vectorization independent of target. I don't know what SIMD looks like on them or what other compilers do. You may need (one of the components of) -ffast-math to make it effective, but I forget the details. Some other options affect it apart from the possibility of -fprofile-use; -fopt-info-... provides reports on what it did.

$ gcc --version|head -1 gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 $ gcc -Q -O3 --help=optimizers|grep vectoriz.*enabled -ftree-loop-vectorize [enabled] -ftree-slp-vectorize [enabled]

However, the faster reference BLAS I used in comparison only uses -O2. I'd expected the BLIS generic target to use similar stuff to openblas' blocked generic routines, though.

[GCC has a polyhedral optimization facility which ought to do useful tiling automatically on naive matmul, but it hasn't been effective when I've tried and I haven't found why. Maybe the LLVM version works better, but that wasn't easy to install when I last looked.]

loveshack commented 5 years ago

Regarding a re-write, what was wrong with kernels/old/c99?

loveshack commented 5 years ago

You wrote:

gcc actually does a bit better with it not unrolled but with constant loop bounds. icc is the only one that can get halfway close to a nice vectorized ukernel.

Note that icc does incorrect/unsafe optimizations by default (at least something like -ffast-math). If gcc fails on this sort of thing, and it's not clear why from -fopt-info, it deserves a bug report; I guess I can look when I have time. (DGEMM drove at least one improvement in loop optimization long ago.)

loveshack commented 5 years ago

You wrote:

@hominhquan Is this extension portable to any architecture that gcc supports? Do you know if clang supports it?

__vectore_size__ is a generic attribute in GCC, but presumably you need to know the width to use per-micro-architecture. See https://gcc.gnu.org/onlinedocs/gcc-6.4.0/gcc/Vector-Extensions.html#Vector-Extensions.

devinamatthews commented 5 years ago

@loveshack The old/ref/c99 kernel is close to what I am suggesting modulo some minor alterations. I think we switched to the current generic kernel so that MR and NR can be set to whatever, but I don't really think that's necessary.

Regarding how gcc and clang are failing, it depends heavily on exactly how the code is written, but it varies from:

  1. Complete lack of vectorization (i.e. vfmadd231sd and friends).
  2. Vectorized (although maybe with xmm), but doing insane things with the loads of a. Even icc sort of does this with the fully-unrolled kernel until you bump it up to 6x8 so that it doesn't have extra registers to do "cute tricks".
  3. (gcc with v8df) Fully vectorized and great except that is spills basically everything to memory unnecessarily.
rvdg commented 5 years ago

It seems to me that we should get someone from Intel (and maybe gcc) interested in this thread so that the compiler folks can learn from your analysis of their mistakes.

hominhquan commented 5 years ago

clang gets faster, but gcc totally tanks

@devinamatthews I am not familiar to all x86 SIMD versions and their gcc backend, but I'm quite surprised that gcc performs poorly on this kind of extension. Just wonder if you put -O3 -ffast-math to unlock gcc optimizations ?

clang should be better, since LLVM was designed with SIMD in mind.

(gcc with v8df) Fully vectorized and great except that is spills basically everything to memory unnecessarily.

Hmmm, how you know that the kernel spills things to memory ? Recent CPU's register file should be large enough to hold those 4*8=32 doubles + 2 matrix pointers in registers, aren't they ? Or we even need to rewrite the loop body to recycle the temporary loaded a. If it is the case I will be disappointed about compilers.

    while (k --> 0)
    {
        v8df a_tmp;

        v8df b07 = *((v8df*)b);

        // This could reduce footprint on a, but may cause pipeline bubble
        a_tmp = {a[0], a[0], a[0], a[0], a[0], a[0], a[0], a[0]};
        c07   = (a_tmp * b07) + c07;

        a_tmp = {a[1], a[1], a[1], a[1], a[1], a[1], a[1], a[1]};
        c17   = (a_tmp * b07) + c17;

        a_tmp = {a[2], a[2], a[2], a[2], a[2], a[2], a[2], a[2]};
        c27   = (a_tmp * b07) + c27;

        a_tmp = {a[3], a[3], a[3], a[3], a[3], a[3], a[3], a[3]};
        c37   = (a_tmp * b07) + c37;

        a += 4;
        b += 8;
    }
devinamatthews commented 5 years ago

@hominhquan Apparently that syntax isn't legal, but I wouldn't think it would make a difference given that the optimization is done in SSA form. Here is a snippet of the assembly generated by gcc (this part does a[0]*b[0..7]; -0x170..-0x80(%rbp) is the ab accumulator, %rsi is a, and %rdx is b):

      c0:       c5 f9 6f 0a                             vmovdqa (%rdx),%xmm1
      c4:       c4 e2 7d 19 1e                          vbroadcastsd (%rsi),%ymm3
      d1:       c5 f8 29 4d 90                          vmovaps %xmm1,-0x70(%rbp)
      d6:       c5 f9 6f 4a d0                          vmovdqa -0x30(%rdx),%xmm1
      db:       c5 fd 28 e3                             vmovapd %ymm3,%ymm4
      df:       c5 f8 29 4d a0                          vmovaps %xmm1,-0x60(%rbp)
      e4:       c5 f9 6f 4a e0                          vmovdqa -0x20(%rdx),%xmm1
      e9:       c5 fd 28 55 90                          vmovapd -0x70(%rbp),%ymm2
      ee:       c4 e2 ed a8 a5 50 ff ff ff              vfmadd213pd -0xb0(%rbp),%ymm2,%ymm4
      f7:       c5 f8 29 4d b0                          vmovaps %xmm1,-0x50(%rbp)
      fc:       c5 f9 6f 4a f0                          vmovdqa -0x10(%rdx),%xmm1
     101:       c5 f8 29 4d c0                          vmovaps %xmm1,-0x40(%rbp)
     106:       c5 fd 28 4d b0                          vmovapd -0x50(%rbp),%ymm1
     10b:       c4 e2 f5 a8 9d 70 ff ff ff              vfmadd213pd -0x90(%rbp),%ymm1,%ymm3
     114:       c5 fd 29 a5 50 fc ff ff                 vmovapd %ymm4,-0x3b0(%rbp)
     11c:       c5 f9 6f ec                             vmovdqa %xmm4,%xmm5
     120:       c5 f8 29 ad 50 ff ff ff                 vmovaps %xmm5,-0xb0(%rbp)
     128:       c5 fd 29 9d 70 fc ff ff                 vmovapd %ymm3,-0x390(%rbp)
     130:       c5 f9 6f e3                             vmovdqa %xmm3,%xmm4
     134:       c5 f9 6f 9d 60 fc ff ff                 vmovdqa -0x3a0(%rbp),%xmm3
     13c:       c5 f8 29 a5 70 ff ff ff                 vmovaps %xmm4,-0x90(%rbp)
     144:       c5 f8 29 9d 60 ff ff ff                 vmovaps %xmm3,-0xa0(%rbp)
     14c:       c5 f9 6f 9d 80 fc ff ff                 vmovdqa -0x380(%rbp),%xmm3
     154:       c5 f8 29 5d 80                          vmovaps %xmm3,-0x80(%rbp)

vs. clang (this is the full a[0..3]*b[0..7] product):

  30:   c5 7d 28 0c 42                          vmovapd (%rdx,%rax,2),%ymm9
  35:   c5 7d 28 54 42 20                       vmovapd 0x20(%rdx,%rax,2),%ymm10
  3b:   c4 62 7d 19 1c 06                       vbroadcastsd (%rsi,%rax,1),%ymm11
  41:   c4 c2 a5 b8 fa                          vfmadd231pd %ymm10,%ymm11,%ymm7
  46:   c4 42 b5 b8 c3                          vfmadd231pd %ymm11,%ymm9,%ymm8
  4b:   c4 62 7d 19 5c 06 08                    vbroadcastsd 0x8(%rsi,%rax,1),%ymm11
  52:   c4 c2 a5 b8 ea                          vfmadd231pd %ymm10,%ymm11,%ymm5
  57:   c4 c2 b5 b8 f3                          vfmadd231pd %ymm11,%ymm9,%ymm6
  5c:   c4 62 7d 19 5c 06 10                    vbroadcastsd 0x10(%rsi,%rax,1),%ymm11
  63:   c4 c2 a5 b8 da                          vfmadd231pd %ymm10,%ymm11,%ymm3
  68:   c4 c2 b5 b8 e3                          vfmadd231pd %ymm11,%ymm9,%ymm4
  6d:   c4 62 7d 19 5c 06 18                    vbroadcastsd 0x18(%rsi,%rax,1),%ymm11
  74:   c4 42 a5 b8 e2                          vfmadd231pd %ymm10,%ymm11,%ymm12
  79:   c4 c2 a5 b8 d1                          vfmadd231pd %ymm9,%ymm11,%ymm2
jeffhammond commented 5 years ago

@rvdg The Intel compiler has an option -qopt-matmul that pattern recognizes triple-loop matrix matrix multiplication and replaces it with a call to MKL GEMM. It doesn't get much better than that 😆

rvdg commented 5 years ago

That is a trick that Fred Gustavson introduced in the 1980s for IBM... The original linpack benchmark had to be the linpack code, compiled. So he came up with the idea of a "code recognizer" that would replace the linpack code with an optimized assembly kernel.

loveshack commented 5 years ago

I realized that building generic gives these warnings (with gcc 8):

  ref_kernels/ind/bli_gemmtrsm4m1_ref.c:221:1: warning: passing argument 2 to restrict-qualified parameter aliases with argument 5 [-Wrestrict]
   INSERT_GENTFUNCCO_BASIC3( gemmtrsm4m1_l, BLIS_CNAME_INFIX, BLIS_REF_SUFFIX, BLIS_TRSM_L_UKR )
   ^~~~~~~~~~~~~~~~~~~~~~~~
  ref_kernels/ind/bli_gemmtrsm4m1_ref.c: In function ‘bli_zgemmtrsm4m1_l_generic_ref’:
  ref_kernels/ind/bli_gemmtrsm4m1_ref.c:221:1: warning: passing argument 2 to restrict-qualified parameter aliases with argument 5 [-Wrestrict]
  ref_kernels/ind/bli_gemmtrsm4m1_ref.c: In function ‘bli_cgemmtrsm4m1_u_generic_ref’:
  ref_kernels/ind/bli_gemmtrsm4m1_ref.c:222:1: warning: passing argument 2 to restrict-qualified parameter aliases with argument 5 [-Wrestrict]
   INSERT_GENTFUNCCO_BASIC3( gemmtrsm4m1_u, BLIS_CNAME_INFIX, BLIS_REF_SUFFIX, BLIS_TRSM_U_UKR )
   ^~~~~~~~~~~~~~~~~~~~~~~~
  ref_kernels/ind/bli_gemmtrsm4m1_ref.c: In function ‘bli_zgemmtrsm4m1_u_generic_ref’:
  ref_kernels/ind/bli_gemmtrsm4m1_ref.c:222:1: warning: passing argument 2 to restrict-qualified parameter aliases with argument 5 [-Wrestrict]

Also, for what it's worth, gcc 8 -O3 has additional optimizations,
including loop-unroll-and-jam.