gap-system / gap

Main development repository for GAP - Groups, Algorithms, Programming, a System for Computational Discrete Algebra
https://www.gap-system.org
GNU General Public License v2.0
813 stars 161 forks source link

Matrix multiplication is sometimes slow #5317

Open james-d-mitchell opened 1 year ago

james-d-mitchell commented 1 year ago

This isn't a bug report or a regression (as far as I'm aware), but just something I noticed recently, and mentioned this morning to @fingolfin @ThomasBreuer and @wucas, from good to bad:

gap> x := RandomMat(2000, 2000, GF(2));;
gap> x * x;;
gap> time;
24
gap> x := RandomMat(2000, 2000, GF(3));;
gap> x * x;;
gap> time;
673

Very good.

gap> x := RandomMat(2000, 2000, Integers);;
gap> x * x;;
gap> time;
11957

This isn't so good, for some comparison, with numpy, I get:

In [1]: import numpy as np
In [2]: x = np.random.rand(2000,2000)
In [3]: %timeit np.matmul(x,x)
155 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is not a like for like comparison because the entries in the numpy matrix x are floats, but I suppose this represents some sort of best (fastest) case.

A similar computation with eigen:

static std::random_device       rd;
static std::mt19937             g(rd());
std::uniform_int_distribution<> d(-12, 12); // to be somewhat similar to GAP

Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic> y;
y.resize(2000, 2000);
for (size_t r = 0; r < y.rows(); ++r) {
  for (size_t c = 0; c < y.cols(); ++c) {
    y(r, c) = d(g);
  }
}
REQUIRE(y * y != y);

This takes about 500ms on my computer (somewhat slower than I might have thought), and finally, for reference, with libsemigroups:

static std::random_device       rd;
static std::mt19937             g(rd());
std::uniform_int_distribution<> d(-12, 12);

IntMat<> x(2000, 2000);
for (size_t r = 0; r < x.number_of_rows(); ++r) {
  for (size_t c = 0; c < x.number_of_cols(); ++c) {
    x(r, c) = d(g);
  }
}
REQUIRE(x * x != x);

takes about 750ms. I include the last comparison because the matrix multiplication in libsemigroups is not particularly tuned, it's just:

void product_inplace(Subclass const& A, Subclass const& B) {
  size_t const             N = A.number_of_rows();
  std::vector<scalar_type> tmp(N, 0);

  for (size_t c = 0; c < N; c++) {
    for (size_t i = 0; i < N; i++) {
      tmp[i] = B(i, c);
    }
    for (size_t r = 0; r < N; r++) {
      (*this)(r, c) = std::inner_product(
          A._container.begin() + r * N,
          A._container.begin() + (r + 1) * N,
          tmp.begin(),
          zero(),
          [this](scalar_type x, scalar_type y) {
            return this->plus(x, y);
          },
          [this](scalar_type x, scalar_type y) {
            return this->prod(x, y);
          });
    }
  }
}

Finally, for matrices over larger finite fields I get:

gap> x := RandomMat(2000, 2000, GF(2 ^ 9));;
gap> x * x;
<2000x2000-matrix over GF(2^9)>
gap> time;
128474

I haven't looked at the implementations in GAP, but will at some point if someone else doesn't beat me to it.

james-d-mitchell commented 1 year ago

@ThomasBreuer pointed out that RandomMatrix is from the Semigroups package, so I've just updated to use RandomMat from the library. Thanks @ThomasBreuer

fingolfin commented 1 year ago

GAP implements naive schoolbook matrix multiplication, so with O(n^3) number of multiplications. The slower the multiplication of entries, the more this matters. That numpy multiplies floats is absolutly significant, because float multiplication is among the most heavily tuned operations on any modern CPU. In contrast, GAP is multiplying arbitrary size integers, and even with our optimization for small integers, this is MUCH slower than float multiplication.

Likewise multiplying in a generic large finite field is slooow.

I don't know about your other examples, e.g. for the Eigen one, where do the coefficients live?

I'd also be surprised if numpy did not implement e.g. Strassen multiplication.

james-d-mitchell commented 1 year ago

Thanks for your comments @fingolfin, the entries in eigen are doubles as in:

Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic> y;
              ^^^^^^
fingolfin commented 1 year ago

the relevant kernel function in GAP is ProdVectorMatrix which multiplies a vector by a matrix, and which has special optimization for integer entries. It has to jump through quite some hoops for this

fingolfin commented 1 year ago

For reference, OSCAR is fast(er) than GAP:

julia> n=2000 ; m = matrix(ZZ, [rand(-5:5) for i in 1:n, j in 1:n]); @time m^2;
  0.378739 seconds (2.96 k allocations: 122.243 MiB, 22.68% gc time, 0.94% compilation time)
ThomasBreuer commented 1 year ago

Concerning the comparison between GAP and OSCAR, things look different if one considers matrices over GF(2). Here GAP is a bit faster than OSCAR; and the OSCAR runtime for the integer matrix is only twice that for the GF(2) matrix.

Note also that creating the random matrices in GAP is slow. @james-d-mitchell had observed this in his experiments, and OSCAR is also much faster than GAP in creating random matrices.

james-d-mitchell commented 1 year ago

I investigated this a bit further yesterday, and it seems that essentially 100% of the time is spent in ProdVectorMatrix for integer matrices constructed as:

x := RandomMat(1000, 1000, Integers);;

Some time is also spent in ARE_INT_OBJS, which can probably be saved to some extent (since if I understand correctly once the accumulated values or the values in the vector/matrix are longer satisfy IS_INT_OBJ they never satisfy IS_INT_OBJ again, and so the checks don't need to be performed), but this only gives a very modest improvement in performance. The remainder of the time is spent in ProdVectorMatrix itself, sum_intobjs and prod_intobjs

I also tried implementing a version of ProdMatrixMatrix along the same lines as in libsemigroups (copying a column of the second argument into a temporary plist for better cache locality or whatever the correct term is) but this didn't provide any improvement in performance (it was more or less equivalent to the current method), I also tried using std::inner_product rather than ProdVectorVector and this too didn't provide any improvement (as might be expected).

I think the take away from this is that there isn't any "quick win" for improving the performance here. @fingolfin do you have any idea what OSCAR does, or where one might look to find out?

fingolfin commented 1 year ago

@james-d-mitchell how did you measure this? Note that ARE_INTOBJS should always be inlined and turned into 1-2 CPU instructions; if you see time spent in it it, that suggests a potential problem with the measurement method.

EXPORT_INLINE Int ARE_INTOBJS(Obj o1, Obj o2)
{
    return (Int)o1 & (Int)o2 & 0x01;
}
james-d-mitchell commented 1 year ago

Here's the profile produced (just now) with instruments on my Mac:

Screenshot 2023-01-19 at 09 32 54