DrTimothyAldenDavis / GraphBLAS

SuiteSparse:GraphBLAS: graph algorithms in the language of linear algebra. For production: (default) STABLE branch. Code development: ask me for the right branch before submitting a PR. video intro: https://youtu.be/Tj5y6d7FegI .
http://faculty.cse.tamu.edu/davis/GraphBLAS.html
Other
345 stars 61 forks source link

[BUG]: Incorrect result(!) in vector-matrix multiply with accumulation in versions 8.0.2 and 8.2.0 #244

Closed eriknw closed 8 months ago

eriknw commented 8 months ago

I'm getting different results depending on the sparsity structure of the Matrix. When using saxpy4, the output doesn't change. Here is the specific GraphBLAS call:

GrB_vxm(v, NULL, GrB_PLUS_FP64, GrB_PLUS_TIMES_SEMIRING_FP64, w, A, NULL);

and the burble message for this call:

 [ GrB_vxm C=A*B, saxpy (saxpy4: F += S*F) (coarse, threads: 1) (C in place)
   0.000114 sec ]

Here are the objects when viewed in Python

v:

"v"        nvals  size  dtype  format
gb.Vector      4     4   FP64    full
-------------------------------------
index     0     1     2     3
value  0.25  0.25  0.25  0.25

w:

"w"        nvals  size  dtype  format
gb.Vector      4     4   FP64    full
-------------------------------------
index         0       1         2        3
value  0.070833  0.0425  0.053125  0.10625

A:

"A"        nvals  nrows  ncols  dtype  format
gb.Matrix      6      4      4  INT64     csr
---------------------------------------------
   0  1  2  3
0     3
1  3     2
2     2     2
3        2

Note carefully that A format is CSR, so the sparsity_control was set to "sparse". With this sparsity for A, the output v does not change!

However, if instead we set the sparsity_control to "bitmap", then we get the expected result.

So far I have only tested this with SuiteSparse:GraphBLAS versions 8.0.2 and 8.2.0.

eriknw commented 8 months ago

This problem persists in SuiteSparse:GraphBLAS 8.2.0

DrTimothyAldenDavis commented 8 months ago

That doesn't look good.

This is a case where the JIT should take over, and the burble should reflect that. Do you have COMPACT and NJIT enabled? In that case, however, the burble should report "(punt)" and you should see a switch to saxpy3.

I'm moving slowly this week, recovering from surgery, so I can only work at the keyboard laying flat. Any chance you could create a stand-alone C program that triggers this bug?

DrTimothyAldenDavis commented 8 months ago

Can you also check the return value of the call to GrB_vxm?

eriknw commented 8 months ago

Can you also check the return value of the call to GrB_vxm?

GrB_SUCCESS

This is a case where the JIT should take over, and the burble should reflect that. Do you have COMPACT and NJIT enabled? In that case, however, the burble should report "(punt)" and you should see a switch to saxpy3.

This is not built with COMPACT or NJIT, but the JIT is not fully enabled. GxB_JIT_C_CONTROL is set to GxB_JIT_RUN. The burble doesn't indicate "punt" to "saxpy3".

DrTimothyAldenDavis commented 8 months ago

Got it. With JIT set to "run", it's possible there is a precompiled and already-loaded JIT kernel for this case. If there were no such kernel, you'd see the "(punt)". Try clearing all the loaded JIT kernels first, by setting the JIT control to "off" then back to "run". If there error goes away, you might have a stale JIT kernel. That could be a bug in the JIT, or it could be a stale kernel from a beta release of GraphBLAS.

Try deleting your ~/.SuiteSparse folder. It's possible you have a stale kernel from a beta release, and thus a lingering bug that is now gone.

I don't add beta suffixes to the cache folder name, but perhaps I should.

If this issue doesn't go away after trying the above, then it looks like a JIT bug of some kind, where I'm not handling the "JIT run" case for saxpy4 correctly. If I handled it correctly, and if there were no JIT kernel already loaded, you'd see a "punt to saxpy3" in the burble.

DrTimothyAldenDavis commented 8 months ago

I've written my own main program that replicates this problem, so I can track it down now.

eriknw commented 8 months ago

Same behavior after removing ~/.SuiteSparse/. No additional messages in the burble.

DrTimothyAldenDavis commented 8 months ago

Turns out it's not an error in the JIT. The JIT is not being used at all (but it should be). There's a bug in GB_AxB_saxpy4.c, and the same bug in GB_AxB_saxpy5.c. I think this bug is new to v8.0, when I changed the logic in these 2 methods to use the JIT.

DrTimothyAldenDavis commented 8 months ago

Found the bug. GB_AxB_saxpy4 and GB_AxB_saxpy5 have slightly different logic than the other saxpy methods. In v7.4.4, I check for a built-in semiring, and then don't do anything if it isn't builtin. Typecasting causes this instance to be non-built-in. If these methods do nothing, GraphBLAS punts to saxpy3.

When I added the JIT, I wanted the JIT to handle the case for non-built-in, but I broke the logic. The "factory kernel" still must check for the built-in case, but that's what I broke in v8.x when adding the JIT.

It's an easy fix (and a silly mistake). My tests must not have checked this case.

I will post a v8.2.1.beta, after all my tests pass.

DrTimothyAldenDavis commented 8 months ago

Tests are still in progress, but I think I can post this as a beta release: https://github.com/DrTimothyAldenDavis/GraphBLAS/releases/tag/v8.2.1.beta2

This should fix this bug.

eriknw commented 8 months ago

8.2.1.beta3 fixes the problem for me (I didn't check beta2), thanks! 🚀

DrTimothyAldenDavis commented 8 months ago

Just posted the stable v8.2.1 (identical to the beta3 version). I will also include this in SuiteSparse v7.2.1 and post that as a stable release today.

I've made the same fix in GraphBLAS v9.0.0.beta6, with the GrB_get / GrB_set implementation, which I can release as soon as the v2.1 C API is stable.

DrTimothyAldenDavis commented 8 months ago

SuiteSparse v7.2.1 (stable) now released, with this bug fix.