Algebraic-Programming / ALP

Home of ALP/GraphBLAS and ALP/Pregel, featuring shared- and distributed-memory auto-parallelisation of linear algebraic and vertex-centric programs. Soon with more to come!
Apache License 2.0
25 stars 4 forks source link

333 current parallel spgemm cannot be merged #334

Closed anyzelman closed 1 week ago

anyzelman commented 6 months ago

Novel implementation of parallel SpGEMM. Running smoke and unit tests now, with LPF.

anyzelman commented 6 months ago

If tests OK, remaining TODOs:

anyzelman commented 3 months ago

Sorry for the delay. This one is now done and at the head of the merge queue re internal demands. Running full unit test suite-- if pass, will start review phase.

Actually, two more TODOs:

anyzelman commented 3 months ago

Building merge notes in flight:

This MR provides a re-implemented shared-memory parallel SpGEMM. It uses a parallel Gustavson's approach. The number of threads is capped by the global buffer size; the SpGEMM does not perform dynamic memory allocations. A basic analytic model furthermore tunes the number of threads potentially further downwards based on (expected) work.

Additional new features contained within this MR:

This MR also fixes the following:

Debug tracing across the SPA code (coordinates.hpp) and the reference BLAS-3 implementation has improved. The unit test for grb::id has been shortened to speed up the CI for some backends. As always, furthermore, this MR includes code style fixes.

anyzelman commented 3 months ago

Sorry for the delay. This one is now done and at the head of the merge queue re internal demands. Running full unit test suite-- if pass, will start review phase.

Actually, two more TODOs:

* [ ]  manually check performance for large mxms

* [ ]  enhance unit test to check for in-place semantics (I suspect this might not yet pass)

at this point (prior the TODOs), all 2110 unit tests pass

anyzelman commented 3 months ago

Sorry for the delay. This one is now done and at the head of the merge queue re internal demands. Running full unit test suite-- if pass, will start review phase.

Actually, two more TODOs:

* [x]  manually check performance for large mxms

* [ ]  enhance unit test to check for in-place semantics (I suspect this might not yet pass)

at this point (the last above TODO remaining)-- all unit tests OK and manual performance tests OK)

anyzelman commented 1 week ago

One data race is confirmed fixed. Now another has appeared-- example 1:

$ tests/unit/mxm_ndebug_reference_omp 
This is functional test tests/unit/mxm_ndebug_reference_omp
Info: grb::init (reference_omp) called. OpenMP is set to utilise 88 threads.
Info: grb::init (reference) called.
    Verifying the semiring version of mxm
     mxm_generic will use 1 threads
     mxm_generic will use 1 threads
    Verifying the operator-monoid version of mxm
     mxm_generic will use 1 threads
     mxm_generic will use 1 threads
    Verifying in-place behaviour of mxm (using semirings)
        in this test, the output nonzero structure is unchanged
        also in this test, we skip RESIZE as we know a priori the capacity is sufficient
     mxm_generic will use 1 threads
    Verifying in-place behaviour of mxm (using monoid-op)
        in this test, the output nonzero structure changes
     mxm_generic will use 1 threads
     mxm_generic will use 3 threads
     expected no entry at position ( 79, 0 ), but got one with value 4
     expected no entry at position ( 80, 0 ), but got one with value 4
     expected no entry at position ( 81, 0 ), but got one with value 4
     expected no entry at position ( 82, 0 ), but got one with value 4
     expected no entry at position ( 83, 0 ), but got one with value 4
     expected no entry at position ( 84, 0 ), but got one with value 4
     expected no entry at position ( 85, 0 ), but got one with value 4
     expected no entry at position ( 86, 0 ), but got one with value 4
     expected no entry at position ( 87, 0 ), but got one with value 4
     expected no entry at position ( 88, 0 ), but got one with value 4
     expected no entry at position ( 89, 0 ), but got one with value 4
     expected no entry at position ( 90, 0 ), but got one with value 4
     expected no entry at position ( 91, 0 ), but got one with value 4
     expected no entry at position ( 92, 0 ), but got one with value 4
     expected no entry at position ( 93, 0 ), but got one with value 4
     expected no entry at position ( 94, 0 ), but got one with value 4
     expected no entry at position ( 95, 0 ), but got one with value 4
     expected no entry at position ( 96, 0 ), but got one with value 4
Test IV did not pass verification
Info: grb::finalize (reference_omp) called.
Info: grb::finalize (reference) called.
Test FAILED (A GraphBLAS algorithm has failed to achieve its intended result (e.g., has not converged))

Example 2:

This is functional test tests/unit/mxm_ndebug_reference_omp
Info: grb::init (reference_omp) called. OpenMP is set to utilise 88 threads.
Info: grb::init (reference) called.
    Verifying the semiring version of mxm
     mxm_generic will use 1 threads
     mxm_generic will use 1 threads
    Verifying the operator-monoid version of mxm
     mxm_generic will use 1 threads
     mxm_generic will use 1 threads
    Verifying in-place behaviour of mxm (using semirings)
        in this test, the output nonzero structure is unchanged
        also in this test, we skip RESIZE as we know a priori the capacity is sufficient
     mxm_generic will use 1 threads
    Verifying in-place behaviour of mxm (using monoid-op)
        in this test, the output nonzero structure changes
     mxm_generic will use 1 threads
     mxm_generic will use 3 threads
     expected no entry at position ( 77, 0 ), but got one with value 4
     expected no entry at position ( 78, 0 ), but got one with value 4
     expected no entry at position ( 79, 0 ), but got one with value 4
     expected no entry at position ( 80, 0 ), but got one with value 4
     expected no entry at position ( 81, 0 ), but got one with value 4
     expected no entry at position ( 82, 0 ), but got one with value 4
     expected no entry at position ( 83, 0 ), but got one with value 4
     expected no entry at position ( 84, 0 ), but got one with value 4
     expected no entry at position ( 85, 0 ), but got one with value 4
     expected no entry at position ( 86, 0 ), but got one with value 4
     expected no entry at position ( 87, 0 ), but got one with value 4
     expected no entry at position ( 88, 0 ), but got one with value 4
     expected no entry at position ( 89, 0 ), but got one with value 4
     expected no entry at position ( 90, 0 ), but got one with value 4
     expected no entry at position ( 91, 0 ), but got one with value 4
     expected no entry at position ( 92, 0 ), but got one with value 4
     expected no entry at position ( 93, 0 ), but got one with value 4
     expected no entry at position ( 94, 0 ), but got one with value 4
     expected no entry at position ( 95, 0 ), but got one with value 4
     expected no entry at position ( 96, 0 ), but got one with value 4
Test IV did not pass verification
Info: grb::finalize (reference_omp) called.
Info: grb::finalize (reference) called.
Test FAILED (A GraphBLAS algorithm has failed to achieve its intended result (e.g., has not converged))

Roughly half of the times the test passes OK. The difference of 2 (start row 77 vs. start row 79) and the fact that there should be nonzeroes on every row suggests some sort of racing condition writing out column indices.

TODOs:

After all that, there's one more issue:

anyzelman commented 1 week ago

One more optimisation idea and one more bug remaining... Putting the former one as a TODO so it's not forgotten:

anyzelman commented 1 week ago

Everything should be done now. Running all smoke and unit tests to confirm all OK, and if so, will start clean-up and merge (finally!)

anyzelman commented 1 week ago

Concept release notes:

This MR provides a novel implementation for the sparse matrix--sparse matrix multiplication, grb::mxm. It modifies the specifications in the following ways:

  1. the mxm is now in-place, meaning, it computes C += AB. This was intended a long time ago, but somehow the mxm was never adapted until now. If the old behaviour is desired, first clear C (grb::clear) before calling mxm.

  2. the grb::resize now retains old nonzeroes, and (therefore) calls to resize must request a capacity that is larger (or equal) to the current number of values in the container, or otherwise grb::ILLEGAL shall be returned.

The newly implemented mxm uses as many threads as internal buffers allow. It still follows Gustavson's approach, where now every thread employs its own private sparse accumulator (SPA)-- which is why the size of the global buffer determines the parallelism of the approach. At least one SPA is guaranteed as the first thread will use the output matrix buffers for its SPA-- threads beyond that first one will use any remaining space in the global buffer. Note that the global buffer size is Theta(m+n+nz) for any matrix in the ALP context, while the SPA memory requirements are proportional to n (or m, if transposed)-- therefore, if nz ~ cn, then this new implementation allows the mxm to proceed with Theta(c) threads.

Regarding the resize, recall that the old behaviour was to clear the given container. The rationale for change this behaviour is that it became apparent these semantics make it both awkward to use as well as implement the RESIZE phase of the mxm.

This awkwardness was never observed for level-{1,2} primitives because all current implementations use a fixed full capacity for vector data.

Some related specification clarifications were added to that of the resize: whether backends allow for shrinking memory usage after a call to resize is defined by its performance semantics. The functional semantics leave this behaviour as optional (and current implementations do not ever shrink capacities).

This MR also includes the following bugfixes:

This MR also includes the following new features:

Tests have been modified as follows:

As always, this MR also includes some code style fixes and provides some missing code documentations.

anyzelman commented 1 week ago

Review OK, waiting test results before merge

anyzelman commented 1 week ago

2400 tests OK, including some of the performance tests. All smoke and unit tests (with LPF) OK. Manually tested OK also the mxm-related perftests. Will merge.