333 current parallel spgemm cannot be merged

anyzelman commented 6 months ago

Novel implementation of parallel SpGEMM. Running smoke and unit tests now, with LPF.

anyzelman commented 6 months ago

If tests OK, remaining TODOs:

[x] factor out analytic model into its own function call (instead of repeating similar logic several times)

anyzelman commented 3 months ago

Sorry for the delay. This one is now done and at the head of the merge queue re internal demands. Running full unit test suite-- if pass, will start review phase.

Actually, two more TODOs:

[x] manually check performance for large mxms
[x] enhance unit test to check for in-place semantics (I suspect this might not yet pass)

anyzelman commented 3 months ago

Building merge notes in flight:

This MR provides a re-implemented shared-memory parallel SpGEMM. It uses a parallel Gustavson's approach. The number of threads is capped by the global buffer size; the SpGEMM does not perform dynamic memory allocations. A basic analytic model furthermore tunes the number of threads potentially further downwards based on (expected) work.

Additional new features contained within this MR:

grb::select now implements a basic analytic model that may reduce the number of threads below the maximum, if the amount of work is light.

This MR also fixes the following:

grb::set (matrix to matrix) could select more than the number of available threads for execution, herewith fixed;
the prefix-sum utility had the same issue, also herewith fixed;
the mxm unit test could detect some errors that did not result in an error code being returned, herewith fixed. (No known issues existed with the previous mxm implementation, hence this was a metabug.)

Debug tracing across the SPA code (coordinates.hpp) and the reference BLAS-3 implementation has improved. The unit test for grb::id has been shortened to speed up the CI for some backends. As always, furthermore, this MR includes code style fixes.

anyzelman commented 3 months ago

Sorry for the delay. This one is now done and at the head of the merge queue re internal demands. Running full unit test suite-- if pass, will start review phase.

Actually, two more TODOs:
* [ ]  manually check performance for large mxms

* [ ]  enhance unit test to check for in-place semantics (I suspect this might not yet pass)

at this point (prior the TODOs), all 2110 unit tests pass

anyzelman commented 3 months ago

Sorry for the delay. This one is now done and at the head of the merge queue re internal demands. Running full unit test suite-- if pass, will start review phase.

Actually, two more TODOs:
* [x]  manually check performance for large mxms

* [ ]  enhance unit test to check for in-place semantics (I suspect this might not yet pass)

at this point (the last above TODO remaining)-- all unit tests OK and manual performance tests OK)

anyzelman commented 1 week ago

One data race is confirmed fixed. Now another has appeared-- example 1:

$ tests/unit/mxm_ndebug_reference_omp 
This is functional test tests/unit/mxm_ndebug_reference_omp
Info: grb::init (reference_omp) called. OpenMP is set to utilise 88 threads.
Info: grb::init (reference) called.
    Verifying the semiring version of mxm
     mxm_generic will use 1 threads
     mxm_generic will use 1 threads
    Verifying the operator-monoid version of mxm
     mxm_generic will use 1 threads
     mxm_generic will use 1 threads
    Verifying in-place behaviour of mxm (using semirings)
        in this test, the output nonzero structure is unchanged
        also in this test, we skip RESIZE as we know a priori the capacity is sufficient
     mxm_generic will use 1 threads
    Verifying in-place behaviour of mxm (using monoid-op)
        in this test, the output nonzero structure changes
     mxm_generic will use 1 threads
     mxm_generic will use 3 threads
     expected no entry at position ( 79, 0 ), but got one with value 4
     expected no entry at position ( 80, 0 ), but got one with value 4
     expected no entry at position ( 81, 0 ), but got one with value 4
     expected no entry at position ( 82, 0 ), but got one with value 4
     expected no entry at position ( 83, 0 ), but got one with value 4
     expected no entry at position ( 84, 0 ), but got one with value 4
     expected no entry at position ( 85, 0 ), but got one with value 4
     expected no entry at position ( 86, 0 ), but got one with value 4
     expected no entry at position ( 87, 0 ), but got one with value 4
     expected no entry at position ( 88, 0 ), but got one with value 4
     expected no entry at position ( 89, 0 ), but got one with value 4
     expected no entry at position ( 90, 0 ), but got one with value 4
     expected no entry at position ( 91, 0 ), but got one with value 4
     expected no entry at position ( 92, 0 ), but got one with value 4
     expected no entry at position ( 93, 0 ), but got one with value 4
     expected no entry at position ( 94, 0 ), but got one with value 4
     expected no entry at position ( 95, 0 ), but got one with value 4
     expected no entry at position ( 96, 0 ), but got one with value 4
Test IV did not pass verification
Info: grb::finalize (reference_omp) called.
Info: grb::finalize (reference) called.
Test FAILED (A GraphBLAS algorithm has failed to achieve its intended result (e.g., has not converged))

Example 2:

This is functional test tests/unit/mxm_ndebug_reference_omp
Info: grb::init (reference_omp) called. OpenMP is set to utilise 88 threads.
Info: grb::init (reference) called.
    Verifying the semiring version of mxm
     mxm_generic will use 1 threads
     mxm_generic will use 1 threads
    Verifying the operator-monoid version of mxm
     mxm_generic will use 1 threads
     mxm_generic will use 1 threads
    Verifying in-place behaviour of mxm (using semirings)
        in this test, the output nonzero structure is unchanged
        also in this test, we skip RESIZE as we know a priori the capacity is sufficient
     mxm_generic will use 1 threads
    Verifying in-place behaviour of mxm (using monoid-op)
        in this test, the output nonzero structure changes
     mxm_generic will use 1 threads
     mxm_generic will use 3 threads
     expected no entry at position ( 77, 0 ), but got one with value 4
     expected no entry at position ( 78, 0 ), but got one with value 4
     expected no entry at position ( 79, 0 ), but got one with value 4
     expected no entry at position ( 80, 0 ), but got one with value 4
     expected no entry at position ( 81, 0 ), but got one with value 4
     expected no entry at position ( 82, 0 ), but got one with value 4
     expected no entry at position ( 83, 0 ), but got one with value 4
     expected no entry at position ( 84, 0 ), but got one with value 4
     expected no entry at position ( 85, 0 ), but got one with value 4
     expected no entry at position ( 86, 0 ), but got one with value 4
     expected no entry at position ( 87, 0 ), but got one with value 4
     expected no entry at position ( 88, 0 ), but got one with value 4
     expected no entry at position ( 89, 0 ), but got one with value 4
     expected no entry at position ( 90, 0 ), but got one with value 4
     expected no entry at position ( 91, 0 ), but got one with value 4
     expected no entry at position ( 92, 0 ), but got one with value 4
     expected no entry at position ( 93, 0 ), but got one with value 4
     expected no entry at position ( 94, 0 ), but got one with value 4
     expected no entry at position ( 95, 0 ), but got one with value 4
     expected no entry at position ( 96, 0 ), but got one with value 4
Test IV did not pass verification
Info: grb::finalize (reference_omp) called.
Info: grb::finalize (reference) called.
Test FAILED (A GraphBLAS algorithm has failed to achieve its intended result (e.g., has not converged))

Roughly half of the times the test passes OK. The difference of 2 (start row 77 vs. start row 79) and the fact that there should be nonzeroes on every row suggests some sort of racing condition writing out column indices.

TODOs:

[x] analyse the fix to the old bug, and ensure its correctness
[x] find the issue that causes this "new" data race
[x] cleanup code (after the racing condition is also fixed)

After all that, there's one more issue:

[x] in the unordered memmove case 2, implement a variant that uses any available buffers to expose more parallelism in the OP;
[x] include polyalgorithm dispatch logic within the unordered_memmove dispatcher via a simple analytic model that chooses automatically when to use the buffered variant for case 2, and when to switch to the current unbuffered variant);
[x] supply the SPA buffer(s) to the unordered_memmove within mxm_generic, thus enabling this optimisation.

anyzelman commented 1 week ago

One more optimisation idea and one more bug remaining... Putting the former one as a TODO so it's not forgotten:

[x] I suspect it is possible to merge the shift-to-right and the third phase of the prefix sum. If correct, this would save one barrier.

anyzelman commented 1 week ago

Everything should be done now. Running all smoke and unit tests to confirm all OK, and if so, will start clean-up and merge (finally!)

anyzelman commented 1 week ago

Concept release notes:

This MR provides a novel implementation for the sparse matrix--sparse matrix multiplication, grb::mxm. It modifies the specifications in the following ways:

the mxm is now in-place, meaning, it computes C += AB. This was intended a long time ago, but somehow the mxm was never adapted until now. If the old behaviour is desired, first clear C (grb::clear) before calling mxm.
the grb::resize now retains old nonzeroes, and (therefore) calls to resize must request a capacity that is larger (or equal) to the current number of values in the container, or otherwise grb::ILLEGAL shall be returned.

The newly implemented mxm uses as many threads as internal buffers allow. It still follows Gustavson's approach, where now every thread employs its own private sparse accumulator (SPA)-- which is why the size of the global buffer determines the parallelism of the approach. At least one SPA is guaranteed as the first thread will use the output matrix buffers for its SPA-- threads beyond that first one will use any remaining space in the global buffer. Note that the global buffer size is Theta(m+n+nz) for any matrix in the ALP context, while the SPA memory requirements are proportional to n (or m, if transposed)-- therefore, if nz ~ cn, then this new implementation allows the mxm to proceed with Theta(c) threads.

Regarding the resize, recall that the old behaviour was to clear the given container. The rationale for change this behaviour is that it became apparent these semantics make it both awkward to use as well as implement the RESIZE phase of the mxm.

This awkwardness was never observed for level-{1,2} primitives because all current implementations use a fixed full capacity for vector data.

Some related specification clarifications were added to that of the resize: whether backends allow for shrinking memory usage after a call to resize is defined by its performance semantics. The functional semantics leave this behaviour as optional (and current implementations do not ever shrink capacities).

This MR also includes the following bugfixes:

[functional bug] the RESIZE phase for the masked grb::set for matrices sometimes did not resize to a large-enough capacity
[functional bug] in the binsearch utility. Expected impact is nil as the function presently is not used from within ALP
[performance bug] the dense descriptor in grb::set(vector,vector,vector) could erroneously be ignored
[performance bug] the analytic model in the grb::set for matrices could erroneously select way too many threads, lowering performance significantly
[performance bug] same as the directly preceding item, but for the OpenMP prefix sum utility
[performance bug] same as the preceding items, but for grb::select
[performance bug] the fold-matrix-to-scalar unit test used copy-by-value instead of const-reference capture. This has no effect on user code.

This MR also includes the following new features:

[utils] a utility that determines the maximum argument for which a monotonic decreasing function on a given range returns its maximum is provided
[utils] an OpenMP prefix-sum variant that also shifts all entries one position to the right has been provided
[utils] an unordered-memmove has been implemented, with both sequential and OpenMP parallel variants (used by the new in-place mxm)
[debugging] more targeted debug flags (e.g., _DEBUG_UTILS_PREFIXSUM to debug only that particular utility or _DEBUG_REFERENCE_BLAS3 to debug only level-3 primitives in the reference backends. Defining _DEBUG will still enable all debug outputs)
[native] a non_owning_view descriptor has been introduced as a requirement for an ALP native interface. This descriptor indicates that containers are used whose associated memory is not managed by ALP.
[internal] The coordinates class now supports calling a parallel set or a sequential one (for use as thread-local SPAs)

Tests have been modified as follows:

[unit] the SpMSpM multiplication unit test now also tests in-place behaviour
[unit] the SpMSpM multiplication unit test now also tests the force_row_major descriptor, which now appears as a separate test in the test suite
[unit] the capacity unit test has been brought in-line with the updated specifications of grb::resize
[unit] the id test now uses 100 tries instead of a 1000 for CI performance reasons
[unit] the parallelRegularIterators test implicitly and unintendedly cast a double to int
[performance] the SpMSpM multiplication driver can now benchmark CRS only also

As always, this MR also includes some code style fixes and provides some missing code documentations.

anyzelman commented 1 week ago

Review OK, waiting test results before merge

anyzelman commented 1 week ago

2400 tests OK, including some of the performance tests. All smoke and unit tests (with LPF) OK. Manually tested OK also the mxm-related perftests. Will merge.

Algebraic-Programming / ALP

333 current parallel spgemm cannot be merged #334