Performance issue (possible codegen bug?)

dextorious commented 7 years ago

Consider the following benchmark code, taken from the collision kernel of a lattice Boltzmann simulation:

function collide_kernel!(n, ρ, ux, uy, ex, ey, w, Q, NX, NY, ω)
  for y in 1 : NY
    for x in 1 : NX
      ρ[x,y], ux[x,y], uy[x,y] = 0., 0., 0.
      for q in 1 : Q
        nq = n[q,x,y]
        ρ[x,y] += nq
        ux[x,y] = muladd(ex[q], nq, ux[x,y])
        uy[x,y] = muladd(ey[q], nq, uy[x,y])
      end
      ρ_inv = 1. / ρ[x,y]
      ux[x,y] *= ρ_inv
      uy[x,y] *= ρ_inv
      usqr = ux[x,y]*ux[x,y] + uy[x,y]*uy[x,y]
      for q in 1 : Q
        eu = 3 * (ex[q]*ux[x,y] + ey[q]*uy[x,y])
        neq = ρ[x,y] * w[q] * ( 1 + eu + 0.5*eu*eu - 1.5*usqr )
        n[q,x,y] = (1 - ω)*n[q,x,y] + ω*neq
      end
    end
  end
  nothing
end

function init_data(N)
  Q = 9
  n = rand(Q,N,N)
  ρ = zeros(N,N)
  ux, uy = zeros(N,N), zeros(N,N)
  ex = Vector{Float64}([0, 1, 0, -1, 0, 1, -1, -1, 1])
  ey = Vector{Float64}([0, 0, 1, 0, -1, 1, 1, -1, -1])
  w  = Vector{Float64}([4/9, 1/9, 1/9, 1/9, 1/9, 1/36, 1/36, 1/36, 1/36])
  return n, ρ, ux, uy, ex, ey, w, Q, N, N, 0.6
end

Benchmarking the code with (data) = init_data(1000); @benchmark collide_kernel!(data...) gives a very consistent 67 ms with Julia started with julia -O3 --check-bounds=no. The Julia installation in question has a system image built to take advantage of the target architecture (Haswell).

To evaluate if this can be optimized further, I ran a C++ version of the same code through clang v4.0.0 (using --std=c++14 -O3 -march=native):

#include <chrono>
#include <iostream>
#include <memory>
#include <random>

static void collision_kernel(double* n, double* ux, double* uy, double* rho,
    const double* ex, const double* ey, const double* w,
    const int Q, const int NX, const int NY, const double omega) {
  for (int i = 0; i < NX; ++i) 
    for (int j = 0; j < NY; ++j) {
      const int idx = i*NY + j;
      ux[idx] = 0.0;
      uy[idx] = 0.0;
      rho[idx] = 0.0;
      for (int q = 0; q < Q; ++q) {
        const double nq = n[idx * Q + q];
        rho[idx] += nq;
        ux[idx] += ex[q] * nq;
        uy[idx] += ey[q] * nq;
      }
      const double rho_inv = 1. / rho[idx];
      ux[idx] *= rho_inv;
      uy[idx] *= rho_inv;
      const double usqr = ux[idx] * ux[idx] + uy[idx] * uy[idx];
      for (int q = 0; q < Q; ++q) {
        const double eu = 3 * (ex[q] * ux[idx] + ey[q] * uy[idx]);
        const double neq = rho[idx] * w[q] * (1.0 + eu + 0.5*eu*eu - 1.5*usqr);
        n[idx*Q + q] = (1 - omega)*n[idx*Q + q] + omega*neq;
      }
    }  
}

int main() {
  const unsigned int NX = 1000;
  const unsigned int NY = 1000;
  const unsigned int N = NX * NY;
  const unsigned int Q = 9;
  const double ex[9] = { 0., 1., 0., -1., 0., 1., -1., -1., 1. };
  const double ey[9] = { 0., 0., 1., 0., -1., 1., 1., -1., -1. };
  const double w[9] = { 4. / 9., 1. / 9., 1. / 9., 1. / 9., 1. / 9., 1. / 36., 1. / 36., 1. / 36., 1. / 36. };
  double* n = new double[N*Q];
  double* ux = new double[N];
  double* uy = new double[N];
  double* rho = new double[N];
  std::random_device rd;
  std::mt19937 mt(rd());
  std::uniform_real_distribution<double> dist(0.0, 1.0);
  for (auto i = 0; i < N; ++i) {
    ux[i] = 0.0;
    uy[i] = 0.0;
    rho[i] = 1.0;
    for (auto q = 0; q < 9; ++q) {
      n[9 * i + q] = dist(mt);
    }
  }
  const unsigned int benchmarkCount = 1000;
  auto start = std::chrono::steady_clock::now();
  for (unsigned int t = 0; t < benchmarkCount; ++t)
    collision_kernel(n, ux, uy, rho, ex, ey, w, Q, NX, NY, 0.6);
  auto end = std::chrono::steady_clock::now();
  auto diff = end - start;
  std::cout << "n[0] = " << n[0] << std::endl;
  std::cout << "Time = " << std::chrono::duration<double, std::milli>(diff).count() / benchmarkCount << " ms" << std::endl;
  return 0;
}

The C++ code gives a consistent 32 ms, a 2.1x advantage over the Julia code. Comparing the assembly produced by LLVM in both cases, the main differences that stand out involve the use of vector loads/stores and loop unrolling (which is far more extensive in the case of clang). I'm not sufficiently familiar with Julia internals to speculate as to whether this is the consequence of a known limitation or a new issue, so I'm reporting it here just in case.

EDIT: versioninfo() output for completeness:

julia> versioninfo()
Julia Version 0.5.2
Commit f4c6c9d4bb* (2017-05-06 16:34 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libblas
  LAPACK: liblapack
  LIBM: libm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

yuyichao commented 7 years ago

For me the timing goes to the same when I force the compiler to not inline collision_kernel with __attribute__((noinline)) so I think the deeper level of unrolling comes from the more information it knows from the caller which we do not expose to llvm.

As for whether the compiler (both gcc and clang) makes the right default choice about the unrolling in the loop, I think that's a llvm/gcc issue.

dextorious commented 7 years ago

It's not immediately clear to me why inlining the collision_kernel function would motivate the loop unrolling, but it does seem to be the case. Granting that point, if the decision is made in LLVM (rather than clang), shouldn't we be able to reproduce the same effect by using @inline on the function declaration and using the function in a way equivalent to the C++ code? I tried doing that, and there is no measurable impact on the performance on my machine, it certainly doesn't lead to further loop unrolling.

I'd love to figure out how to reproduce the optimizations seen in the C++ code from Julia. Not to overstate the importance of a single benchmark, but I have seen a number of similar cases where a 2-3x performance difference in favor of C++ appears to result almost solely from decisions made on loop unrolling and I don't think this is an isolated example. For me, it's currently a decision about whether to rewrite parts of my codebase in C++ for performance or not, so if there's anything I can do to further investigate this issue, I'm quite motivated to do so.

Keno commented 7 years ago

C++ can see the constants (after inlining). I think you'll probably see similar performance if you put the constant initialization and the computation into the same function (or use @inline to have it for you). There may also be some concerns around aliasing. C++ certainly knowns that results of allocations don't alias. I'm not sure whether we expose the same information to LLVM, but doing so would probably be pretty simple.

dextorious commented 7 years ago

I tried using @inline, to no effect whatsoever. As for aliasing, my original C++ code had explicit __restrict__ annotations, but I found their effect to be very small in this case (within 5-10%). Obviously, I cannot exclude the possibility that the C++ code gets better aliasing analysis even without the annotations, but I'm not sure why that would lead to the loop unrolling seen in this case.

yuyichao commented 7 years ago

@inline directly won't have any effect, since the caller doesn't know the value here.

As long as the C compilier can see the caller and given that it inlines the function, it can certainly know that the arrays don't alias. You can help the compiler here by writing slightly better code

function collide_kernel!(n, ρ, ux, uy, ex, ey, w, Q, NX, NY, ω)
    @inbounds for y in 1 : NY
        for x in 1 : NX
            ρxy, uxxy, uyxy = 0.0, 0.0, 0.0
            for q in 1 : Q
                nq = n[q,x,y]
                ρxy += nq
                uxxy = muladd(ex[q], nq, uxxy)
                uyxy = muladd(ey[q], nq, uyxy)
            end
            ρ_inv = 1. / ρxy
            uxxy *= ρ_inv
            uyxy *= ρ_inv
            usqr = uxxy * uxxy + uyxy * uyxy
            for q in 1 : Q
                eu = 3 * (ex[q] * uxxy + ey[q] * uyxy)
                neq = ρxy * w[q] * ( 1 + eu + 0.5*eu*eu - 1.5*usqr )
                n[q,x,y] = (1 - ω)*n[q,x,y] + ω*neq
            end
            ρ[x,y], ux[x,y], uy[x,y] = ρxy, uxxy, uyxy
        end
    end
end

The function above along runs in 21ms on my laptop, compare to 49ms of original julia version and 22ms for C.

Keno commented 7 years ago

Obviously, I cannot exclude the possibility that the C++ code gets better aliasing analysis even without the annotations, but I'm not sure why that would lead to the loop unrolling seen in this case.

Aliasing is pretty much the primary obstacle to loop analyses these days. Without them, the compiler can do very little in the way of LICM (as a result the loop looks bigger, so unrolling heuristics do different things). I have some changes in the pipeline that improve our alias information, at which point that should no longer be a problem.

yuyichao commented 7 years ago

(C timing was a little wrong, it should be ~22ms so both C and julia has the same speed)

I'm not sure if the unrolling matters much here but the main issue I see is that the first loop can only be vectorized if the compiler can proof that the store to ρ[x,y], ux[x,y], uy[x,y] in the loop don't alias anything else. With the store/load on them hoisted out manually, the loop vectorizes and that's why it runs much faster.

yuyichao commented 7 years ago

So close as dup of https://github.com/JuliaLang/julia/issues/19658 ?

Keno commented 7 years ago

Yes

Keno commented 7 years ago

and a reminder to also teach the compiler that different allocations don't alias, so we'd see the same behavior as C++ when inlining is enabled.

dextorious commented 7 years ago

Thank you both for the help, explicitly avoiding the aliasing issue does make most of the performance difference go away. On my machine they aren't quite equal yet, but within 30%, which is good enough for now.

And I'm eagerly looking forward to noalias hints, whichever form they may ultimately take.

Keno commented 7 years ago

Since you're on haswell, I should also mention that fma fusion is currently mostly disabled (though muladd should explicitly allow it). That's a bit of an oversight, but might explain part of the perf difference.

dextorious commented 7 years ago

I'm generally in the habit of writing explicit muladds in the obvious places for that reason, but it didn't seem to improve the performance here and the C++ version doesn't seem to make much use of FMAs in this case either, so I didn't pursue the matter. Since I've only tried this on Julia v0.5 and @yuyichao mentioned he got to effective performance parity (which I haven't quite gotten to yet, it's within 1.3-1.5x), I'll try the same code on the latest master next, see if that makes any difference (if only from the newer LLVM version).

JuliaLang / julia

Performance issue (possible codegen bug?) #21966