Open james-d-mitchell opened 1 year ago
@ThomasBreuer pointed out that RandomMatrix
is from the Semigroups package, so I've just updated to use RandomMat
from the library. Thanks @ThomasBreuer
GAP implements naive schoolbook matrix multiplication, so with O(n^3)
number of multiplications. The slower the multiplication of entries, the more this matters. That numpy multiplies floats is absolutly significant, because float multiplication is among the most heavily tuned operations on any modern CPU. In contrast, GAP is multiplying arbitrary size integers, and even with our optimization for small integers, this is MUCH slower than float multiplication.
Likewise multiplying in a generic large finite field is slooow.
I don't know about your other examples, e.g. for the Eigen one, where do the coefficients live?
I'd also be surprised if numpy did not implement e.g. Strassen multiplication.
Thanks for your comments @fingolfin, the entries in eigen
are double
s as in:
Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic> y;
^^^^^^
the relevant kernel function in GAP is ProdVectorMatrix
which multiplies a vector by a matrix, and which has special optimization for integer entries. It has to jump through quite some hoops for this
For reference, OSCAR is fast(er) than GAP:
julia> n=2000 ; m = matrix(ZZ, [rand(-5:5) for i in 1:n, j in 1:n]); @time m^2;
0.378739 seconds (2.96 k allocations: 122.243 MiB, 22.68% gc time, 0.94% compilation time)
Concerning the comparison between GAP and OSCAR, things look different if one considers matrices over GF(2). Here GAP is a bit faster than OSCAR; and the OSCAR runtime for the integer matrix is only twice that for the GF(2) matrix.
Note also that creating the random matrices in GAP is slow. @james-d-mitchell had observed this in his experiments, and OSCAR is also much faster than GAP in creating random matrices.
I investigated this a bit further yesterday, and it seems that essentially 100% of the time is spent in ProdVectorMatrix
for integer matrices constructed as:
x := RandomMat(1000, 1000, Integers);;
Some time is also spent in ARE_INT_OBJS
, which can probably be saved to some extent (since if I understand correctly once the accumulated values or the values in the vector/matrix are longer satisfy IS_INT_OBJ
they never satisfy IS_INT_OBJ
again, and so the checks don't need to be performed), but this only gives a very modest improvement in performance. The remainder of the time is spent in ProdVectorMatrix
itself, sum_intobjs
and prod_intobjs
I also tried implementing a version of ProdMatrixMatrix
along the same lines as in libsemigroups
(copying a column of the second argument into a temporary plist for better cache locality or whatever the correct term is) but this didn't provide any improvement in performance (it was more or less equivalent to the current method), I also tried using std::inner_product
rather than ProdVectorVector
and this too didn't provide any improvement (as might be expected).
I think the take away from this is that there isn't any "quick win" for improving the performance here. @fingolfin do you have any idea what OSCAR does, or where one might look to find out?
@james-d-mitchell how did you measure this? Note that ARE_INTOBJS
should always be inlined and turned into 1-2 CPU instructions; if you see time spent in it it, that suggests a potential problem with the measurement method.
EXPORT_INLINE Int ARE_INTOBJS(Obj o1, Obj o2)
{
return (Int)o1 & (Int)o2 & 0x01;
}
Here's the profile produced (just now) with instruments on my Mac:
This isn't a bug report or a regression (as far as I'm aware), but just something I noticed recently, and mentioned this morning to @fingolfin @ThomasBreuer and @wucas, from good to bad:
Very good.
This isn't so good, for some comparison, with
numpy
, I get:This is not a like for like comparison because the entries in the numpy matrix
x
are floats, but I suppose this represents some sort of best (fastest) case.A similar computation with
eigen
:This takes about 500ms on my computer (somewhat slower than I might have thought), and finally, for reference, with
libsemigroups
:takes about 750ms. I include the last comparison because the matrix multiplication in
libsemigroups
is not particularly tuned, it's just:Finally, for matrices over larger finite fields I get:
I haven't looked at the implementations in GAP, but will at some point if someone else doesn't beat me to it.