JuliaSparse / SuiteSparseGraphBLAS.jl

Sparse, General Linear Algebra for Graphs!
MIT License
102 stars 16 forks source link

preallocation for multiplication #108

Closed mzy2240 closed 10 months ago

mzy2240 commented 1 year ago

For a GBMatrix A (dense) and a GBMatrix B (sparse) multiplication, is it possible to do pre-allocation or inplace opereation to further minimize the time?

rayegun commented 1 year ago

If you have a GBMatrix C that has the right internal storage type then yes you can do mul!(C, A, B). If C does not have the right sparsity (ie. If it is dense but the result is actually sparse) it will reallocate. Otherwise it should be in place.

rayegun commented 1 year ago

Note that almost every operation supports a bang (!) version that is semantically in place. Whether or not it reallocates is up to the internals of the matrix and whether it needs to change sparsity.

fcdimitr commented 1 year ago

I observed the following (counterintuitive) behavior:

julia> using SuiteSparseGraphBLAS, SparseArrays, BenchmarkTools

julia> X = GBMatrix( sprand(10, 1000, 0.9) );

julia> # without pre-allocation
       @btime $(X') * $X;
  34.615 ms (9 allocations: 8.58 MiB)

julia> # use the correct type for preallocation
       D = X' * X;

julia> # with preallocation
       @btime mul!( $D, $(X'), $X );
  31.978 ms (4 allocations: 8.58 MiB)

Although the number of allocations is halved, the amount of data allocated is practically the same. It is unclear why there are still 8.58 MiB of allocations made. The format of D is correct; it is the same in both cases. Any ideas or workarounds, @Wimmerer?

rayegun commented 1 year ago

@fcdimitr this is sort of a peculiarity of GraphBLAS. If you try the final invocation with accum = + as a kwarg you'll see the allocation disappear since the library is able to realize that it can do this in-place. The reason this doesn't work with the default second accumulator is that in order to be in-place the current implementation requires that accum matches the monoid of the semiring being used. I'm uncertain if this can be relaxed.

@DrTimothyAldenDavis do you have any advice on the comment directly above this one? Masking with D doesn't seem to help. Is this just a missing kernel awaiting JIT?

DrTimothyAldenDavis commented 1 year ago

I would have to see the underlying calls to GraphBLAS.

I can only do things in place if the output is full, and the accum operator is used (as in C += AB where C is full). But D=X'X where X is sparse should normally lead to a sparse matrix D, not a full one.

rayegun commented 1 year ago

Oh duh @fcdimitr this is a full matrix on output, I thought this was sparse. @DrTimothyAldenDavis despite it being full there are still allocations, as expected. I guess we'll have to wait on JIT, though, to allow second accum? The invocations are all dot2 bitmap to full.

DrTimothyAldenDavis commented 1 year ago

It's full by chance. I have to figure out that it's full, so I have to compute the pattern first. Then I realize it's full and then I convert the result to full.

DrTimothyAldenDavis commented 1 year ago

I have to do this even with a JIT.

If instead you want to force the output to be full, you can do this:

C = full matrix of the right size, all zero. C += X'*X

then in principle, the 'dot2' method could exploit the accumulator. But I don't think I have a kernel that exploits this case yet.

fcdimitr commented 1 year ago

Thank you both, @DrTimothyAldenDavis and @Wimmerer! Your suggestions successfully address the issue of unnecessary space allocation in the dense output case.

However, when dealing with sparser matrices and expecting a sparse output, I noticed that a significant amount of allocation is still happening despite my efforts to enforce the desired output format, order, and sparsity pattern.

I am adding below a new benchmark script:

julia> using SuiteSparseGraphBLAS, SparseArrays, BenchmarkTools, Random

julia> Random.seed!(0); # reproducibility

julia> X = GBMatrix( sprand(10, 1000, 0.9) );

julia> # without pre-allocation
       @btime $(X') * $X;
  31.497 ms (9 allocations: 8.58 MiB)

julia> # use the correct type for preallocation
       D = X' * X;

julia> # with preallocation
       @btime mul!( $D, $(X'), $X );
  32.156 ms (4 allocations: 8.58 MiB)

julia> # enforce accum to be + to avoid allocations
       @btime mul!( $D, $(X'), $X; accum = + ) setup=( D=GBMatrix(zeros(size(X,2), size(X,2))) );
  29.042 ms (6 allocations: 272 bytes)

julia> ## Let's check the same with sparser matrices (and output)
       X = GBMatrix( sprand(10, 1000, 0.1) );

julia> # without pre-allocation
       @btime $(X') * $X;
  923.561 μs (17 allocations: 1.50 MiB)

julia> # use the correct type for preallocation
       D = X' * X;

julia> # with preallocation
       @btime mul!( $D, $(X'), $X );
  911.409 μs (12 allocations: 1.50 MiB)

julia> # empty out D, but keep the correct sparsity/format 
       # (maybe there is a better way to do this?)
       ptr,idx,nzval,repack! = SuiteSparseGraphBLAS.tempunpack!(D);

julia> nzval .= 0;

julia> repack!();

julia> # enforce accum to be + to avoid allocations
       @btime mul!( Dtmp, $(X'), $X; accum = + ) setup=( Dtmp=copy(D) );
  3.278 ms (25 allocations: 3.02 MiB)

julia> # enforce accum to be + to avoid allocations using mask
       M = SuiteSparseGraphBLAS.Structural(D);

julia> @btime mul!( Dtmp, $(X'), $X; accum = +, mask = $M ) setup=( Dtmp=copy(D) );
  5.781 ms (15 allocations: 1.50 MiB)

I suspect this additional allocation occurs because the library needs to determine the pattern of X'*X and verify if it matches the preallocated output's pattern. Is there a way to inform the GraphBLAS library that the preallocated output pattern aligns with the expected result? This way, the multiplication could be performed in place, even when the output is sparse. I tried using a mask with the same sparsity as the output, but it did not remove allocations.

Thank you very much!

PS: Tested under Julia 1.9.0. The versions of the packages used:

  [6e4b80f9] BenchmarkTools v1.3.2
  [c2e53296] SuiteSparseGraphBLAS v0.9.0
  [9a3f8284] Random
  [2f01184e] SparseArrays
rayegun commented 1 year ago

That's a smart way to set everything to 0 without making them iso! Good find.

Unfortunately I don't think this kernel exists yet. With the new JIT it will be somewhat easier to add but it's a rare case.

I'll try to keep this use case in mind and look for some solutions. On the Julia side and the SSGrB side (for instance I'm working on a way to use Finch.jl to fill in missing kernels so we get Tim's Uber fast hand tuned code with (still fast) autogenerated code for less common kernels).

Relatedly MKLSparse has the inspector executor routines which may help (and I plan to support those too).

Tim's codes are pretty much always faster (often by 5x+), but if you're allocation sensitive then the MKL codes might be useful. I can add a package extension to MKLSparse to make passing GBMatrix easier!

DrTimothyAldenDavis commented 1 year ago

I've tried MKL Sparse and GraphBLAS is often much faster. But perhaps the preallocation would differ.

If there is an expected sparsity pattern of the result, it might be possible to use the pattern of the result itself as its own mask. This would be C<C,structural> += whatever. The accumulator also tells me that no entry in the output can be deleted.

I exploit this kind of thing for GrB_assign, but not for GrB_mxm where this use case would be rare. I would need to write a new kernel to exploit this particular case, and to reduce the number of reallocations. This is something I could consider in the future. We would need to categorize all the use cases in Julia where this sort of thing occurs, map them to GraphBLAS, and then make sure my kernels exploit those particular use cases.

rayegun commented 1 year ago

Yeah I would only recommend the use of mkl_sparse_sp2m in the case where you know exactly the memory required ahead of time and you will reuse this memory often. Indeed I would recommend unpacking GBMatrix, running just that function with MKL, and then repacking into GraphBLAS for all the other fast operations.

Removing allocations by specifying mask = structural(output) (with accum=second possibly) is the most common request I have seen Tim.

rayegun commented 10 months ago

Preallocation isn't currently planned for this library. It may be possible in the future but the spec doesn't really make room for pre-allocation.