Closed anyzelman closed 1 week ago
If tests OK, remaining TODOs:
Sorry for the delay. This one is now done and at the head of the merge queue re internal demands. Running full unit test suite-- if pass, will start review phase.
Actually, two more TODOs:
Building merge notes in flight:
This MR provides a re-implemented shared-memory parallel SpGEMM. It uses a parallel Gustavson's approach. The number of threads is capped by the global buffer size; the SpGEMM does not perform dynamic memory allocations. A basic analytic model furthermore tunes the number of threads potentially further downwards based on (expected) work.
Additional new features contained within this MR:
This MR also fixes the following:
Debug tracing across the SPA code (coordinates.hpp) and the reference BLAS-3 implementation has improved. The unit test for grb::id
has been shortened to speed up the CI for some backends. As always, furthermore, this MR includes code style fixes.
Sorry for the delay. This one is now done and at the head of the merge queue re internal demands. Running full unit test suite-- if pass, will start review phase.
Actually, two more TODOs:
* [ ] manually check performance for large mxms * [ ] enhance unit test to check for in-place semantics (I suspect this might not yet pass)
at this point (prior the TODOs), all 2110 unit tests pass
Sorry for the delay. This one is now done and at the head of the merge queue re internal demands. Running full unit test suite-- if pass, will start review phase.
Actually, two more TODOs:
* [x] manually check performance for large mxms * [ ] enhance unit test to check for in-place semantics (I suspect this might not yet pass)
at this point (the last above TODO remaining)-- all unit tests OK and manual performance tests OK)
One data race is confirmed fixed. Now another has appeared-- example 1:
$ tests/unit/mxm_ndebug_reference_omp
This is functional test tests/unit/mxm_ndebug_reference_omp
Info: grb::init (reference_omp) called. OpenMP is set to utilise 88 threads.
Info: grb::init (reference) called.
Verifying the semiring version of mxm
mxm_generic will use 1 threads
mxm_generic will use 1 threads
Verifying the operator-monoid version of mxm
mxm_generic will use 1 threads
mxm_generic will use 1 threads
Verifying in-place behaviour of mxm (using semirings)
in this test, the output nonzero structure is unchanged
also in this test, we skip RESIZE as we know a priori the capacity is sufficient
mxm_generic will use 1 threads
Verifying in-place behaviour of mxm (using monoid-op)
in this test, the output nonzero structure changes
mxm_generic will use 1 threads
mxm_generic will use 3 threads
expected no entry at position ( 79, 0 ), but got one with value 4
expected no entry at position ( 80, 0 ), but got one with value 4
expected no entry at position ( 81, 0 ), but got one with value 4
expected no entry at position ( 82, 0 ), but got one with value 4
expected no entry at position ( 83, 0 ), but got one with value 4
expected no entry at position ( 84, 0 ), but got one with value 4
expected no entry at position ( 85, 0 ), but got one with value 4
expected no entry at position ( 86, 0 ), but got one with value 4
expected no entry at position ( 87, 0 ), but got one with value 4
expected no entry at position ( 88, 0 ), but got one with value 4
expected no entry at position ( 89, 0 ), but got one with value 4
expected no entry at position ( 90, 0 ), but got one with value 4
expected no entry at position ( 91, 0 ), but got one with value 4
expected no entry at position ( 92, 0 ), but got one with value 4
expected no entry at position ( 93, 0 ), but got one with value 4
expected no entry at position ( 94, 0 ), but got one with value 4
expected no entry at position ( 95, 0 ), but got one with value 4
expected no entry at position ( 96, 0 ), but got one with value 4
Test IV did not pass verification
Info: grb::finalize (reference_omp) called.
Info: grb::finalize (reference) called.
Test FAILED (A GraphBLAS algorithm has failed to achieve its intended result (e.g., has not converged))
Example 2:
This is functional test tests/unit/mxm_ndebug_reference_omp
Info: grb::init (reference_omp) called. OpenMP is set to utilise 88 threads.
Info: grb::init (reference) called.
Verifying the semiring version of mxm
mxm_generic will use 1 threads
mxm_generic will use 1 threads
Verifying the operator-monoid version of mxm
mxm_generic will use 1 threads
mxm_generic will use 1 threads
Verifying in-place behaviour of mxm (using semirings)
in this test, the output nonzero structure is unchanged
also in this test, we skip RESIZE as we know a priori the capacity is sufficient
mxm_generic will use 1 threads
Verifying in-place behaviour of mxm (using monoid-op)
in this test, the output nonzero structure changes
mxm_generic will use 1 threads
mxm_generic will use 3 threads
expected no entry at position ( 77, 0 ), but got one with value 4
expected no entry at position ( 78, 0 ), but got one with value 4
expected no entry at position ( 79, 0 ), but got one with value 4
expected no entry at position ( 80, 0 ), but got one with value 4
expected no entry at position ( 81, 0 ), but got one with value 4
expected no entry at position ( 82, 0 ), but got one with value 4
expected no entry at position ( 83, 0 ), but got one with value 4
expected no entry at position ( 84, 0 ), but got one with value 4
expected no entry at position ( 85, 0 ), but got one with value 4
expected no entry at position ( 86, 0 ), but got one with value 4
expected no entry at position ( 87, 0 ), but got one with value 4
expected no entry at position ( 88, 0 ), but got one with value 4
expected no entry at position ( 89, 0 ), but got one with value 4
expected no entry at position ( 90, 0 ), but got one with value 4
expected no entry at position ( 91, 0 ), but got one with value 4
expected no entry at position ( 92, 0 ), but got one with value 4
expected no entry at position ( 93, 0 ), but got one with value 4
expected no entry at position ( 94, 0 ), but got one with value 4
expected no entry at position ( 95, 0 ), but got one with value 4
expected no entry at position ( 96, 0 ), but got one with value 4
Test IV did not pass verification
Info: grb::finalize (reference_omp) called.
Info: grb::finalize (reference) called.
Test FAILED (A GraphBLAS algorithm has failed to achieve its intended result (e.g., has not converged))
Roughly half of the times the test passes OK. The difference of 2 (start row 77 vs. start row 79) and the fact that there should be nonzeroes on every row suggests some sort of racing condition writing out column indices.
TODOs:
After all that, there's one more issue:
One more optimisation idea and one more bug remaining... Putting the former one as a TODO so it's not forgotten:
Everything should be done now. Running all smoke and unit tests to confirm all OK, and if so, will start clean-up and merge (finally!)
Concept release notes:
This MR provides a novel implementation for the sparse matrix--sparse matrix multiplication, grb::mxm
. It modifies the specifications in the following ways:
the mxm
is now in-place, meaning, it computes C += AB. This was intended a long time ago, but somehow the mxm
was never adapted until now. If the old behaviour is desired, first clear C (grb::clear
) before calling mxm
.
the grb::resize
now retains old nonzeroes, and (therefore) calls to resize must request a capacity that is larger (or equal) to the current number of values in the container, or otherwise grb::ILLEGAL
shall be returned.
The newly implemented mxm uses as many threads as internal buffers allow. It still follows Gustavson's approach, where now every thread employs its own private sparse accumulator (SPA)-- which is why the size of the global buffer determines the parallelism of the approach. At least one SPA is guaranteed as the first thread will use the output matrix buffers for its SPA-- threads beyond that first one will use any remaining space in the global buffer. Note that the global buffer size is Theta(m+n+nz) for any matrix in the ALP context, while the SPA memory requirements are proportional to n (or m, if transposed)-- therefore, if nz ~ cn, then this new implementation allows the mxm
to proceed with Theta(c) threads.
Regarding the resize
, recall that the old behaviour was to clear the given container. The rationale for change this behaviour is that it became apparent these semantics make it both awkward to use as well as implement the RESIZE
phase of the mxm
.
This awkwardness was never observed for level-{1,2} primitives because all current implementations use a fixed full capacity for vector data.
Some related specification clarifications were added to that of the resize
: whether backends allow for shrinking memory usage after a call to resize
is defined by its performance semantics. The functional semantics leave this behaviour as optional (and current implementations do not ever shrink capacities).
This MR also includes the following bugfixes:
RESIZE
phase for the masked grb::set
for matrices sometimes did not resize to a large-enough capacitygrb::set(vector,vector,vector)
could erroneously be ignoredgrb::set
for matrices could erroneously select way too many threads, lowering performance significantlygrb::select
This MR also includes the following new features:
mxm
)_DEBUG_UTILS_PREFIXSUM
to debug only that particular utility or _DEBUG_REFERENCE_BLAS3
to debug only level-3 primitives in the reference backends. Defining _DEBUG
will still enable all debug outputs)Tests have been modified as follows:
grb::resize
As always, this MR also includes some code style fixes and provides some missing code documentations.
Review OK, waiting test results before merge
2400 tests OK, including some of the performance tests. All smoke and unit tests (with LPF) OK. Manually tested OK also the mxm
-related perftests. Will merge.
Novel implementation of parallel SpGEMM. Running smoke and unit tests now, with LPF.