Open DrTimothyAldenDavis opened 3 months ago
Correct. LLVM was updated from version 17 to version 18 recently in MSYS2.
If I understand correctly some of the tests for LAGraph are failing since that update. Is that correct? Or is there another issue? To be honest, I don't understand what is done in those tests. Is there a commonality to the failing tests?
Maybe, the LLVM update broke their compiler. Ideally, we could report that upstream with some context how to reproduce the error.
Some background: MSYS2 is in the process of dropping support for 32-bit platforms: https://www.msys2.org/news/#2023-12-13-starting-to-drop-some-32-bit-packages
But iiuc, they didn't plan on dropping support for the compiler already. Distributing a broken compiler is worse than distributing nothing though imho...
Yes, there are 4 tests that fail in LAGraph. I thought at first it was because of some of my changes in GraphBLAS (9.0.3 to 9.1.0). But then I tried the stable branch and it failed in the identical manner.
The errors are strange but are repeatable. One method fails with "OMP: out of heap memory" which makes no sense. I'm guessing it's a bug in the update to OpenMP. Perhaps CLANG32 with no OpenMP would work.
The code in the stable branch passed the CI about a week ago. It also passed here: https://github.com/DrTimothyAldenDavis/SuiteSparse/actions/runs/8124994169 which is the same code in the current stable branch ( https://github.com/DrTimothyAldenDavis/SuiteSparse/commit/d4dad6c1d0b5cb3e7c5d7d01ef55653713567662 ).
When the CLANG32 CI failed on dev2, I tried running it manually on the same d4dad6c version in the stable branch, but it failed: https://github.com/DrTimothyAldenDavis/SuiteSparse/actions/runs/8379970870 .
Between these 2 CI runs of the stable branch, on d4dad6c, no code of my changed. The only thing changed was the github runner used. I diff'd the logs and saw that these 2 runs use different github runners. I had to process the logs to strip the leading text on each line first.
Here is the good output from 3 weeks ago, with the time stamp stripped from each line:
good.txt
Here is the bad output from just yesterday: bad.txt
and the diff: diff_bad_good.txt
Here is a trimmed diff with just the pertinent problems: summary.txt
In the summary.txt file, the 4 failed tests are the same that fail when using the latest GraphBLAS 9.1.0 with LAgraph 1.1.3, in the dev2 and now dev branches.
So it's not my code that's broken. Something broke in github.
I agree that it is pretty unlikely that this is an error in the SuiteSparse sources that only shows up in that build configuration. I don't think it is the GitHub runners that cause the issue here. The same runners still work correctly for the other build environments (e.g., MINGW32 which is GCC targeting Windows 32-bit). It's more likely that it is the update to a newer LLVM in MSYS2 that caused the issue. MSYS2 packages and distributes binaries for Windows (MinGW), similar to Homebrew for macOS. They also do rolling releases with all its advantages and disadvantages.
Do the four failing tests have anything in common? Like do they use the same omp pragma or something similar? Or do they test the same functions in GraphBLAS or LAGraph that might get miscompiled?
LLVM 18.1.2 has been released recently. MSYS2 will probably update to that version soon. Maybe, they've already fixed it?
I haven't figured out why those 4 tests fail. They seem to have nothing in common. They likely do, I just don't know what it is. It's hard to track down since I'm not even sure which calls to GraphBLAS are failing, since each of these failed LAgraph methods makes lots of calls to GraphBLAS. It's probably a bug in OpenMP that is causing GraphBLAS to fail in some weird way.
Yes, when I say "the github runner is broken" I meant something in github or in the packages it uses is broken. I'm guessing it's either the 32-bit clang compiler or its openmp library that's broken.
The first few lines of the summary.txt shows the 2 github runner versions:
1c1
< Current runner version: '2.314.1'
---
> Current runner version: '2.313.0'
and later on, you see the llvm and openmp differences:
145,147c174,176
< mingw-w64-clang-i686-llvm-18.1.1-3-any downloading...
< mingw-w64-clang-i686-clang-18.1.1-3-any downloading...
< mingw-w64-clang-i686-llvm-libs-18.1.1-3-any downloading...
---
> mingw-w64-clang-i686-llvm-17.0.6-7-any downloading...
> mingw-w64-clang-i686-clang-17.0.6-7-any downloading...
> mingw-w64-clang-i686-llvm-libs-17.0.6-7-any downloading...
and this one:
183d211
< mingw-w64-clang-i686-openmp-18.1.1-1-any downloading...
184a213
> mingw-w64-clang-i686-openmp-17.0.6-1-any downloading...
Those packages are the only things that differ between the two runs. My code is the same. The one with clang-18.*
fails while clang-17.*
works.
I haven't tried CLANG32 with OpenMP disabled. If that works then the bug is in mingw-w64-clang-i686-openmp-18.1.1-1-any
The simplest thing for now is to just disable MINGW(CLANG32) entirely. I can renable it sometime in the future, once github switches to a fixed MSYS2 distribution for this case.
To preserve this error, I will make a copy of the stable branch in SuiteSparse, and archive it: https://github.com/DrTimothyAldenDavis/SuiteSparse/tree/github_CI_broke_this_branch
I'm able to reproduce the errors locally in a CLANG32 build environment. When I build without OpenMP, the number of failing tests reduces to 2:
$ ctest . --rerun-failed --output-on-failure
Test project D:/repo/SuiteSparse/SuiteSparse/build-clang32
Start 70: LAGraphX_BF
1/2 Test #70: LAGraphX_BF ......................***Failed 0.08 sec
Test test_BF... transpose time: 0.001
==========input graph: nodes: 34 edges: 156 source node: 0
BF_full1 time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full1a time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full2 time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
t(BF_full1) / t(BF_full): -nan(ind)
Matrix: karate.mtx
GrB_BOOL matrix: 34-by-34 entries: 156
(0, 1) 1
(0, 2) 1
(0, 3) 1
(0, 4) 1
(0, 5) 1
(0, 6) 1
(0, 7) 1
(0, 8) 1
(0, 10) 1
(0, 11) 1
(0, 12) 1
(0, 13) 1
(0, 17) 1
(0, 19) 1
(0, 21) 1
(0, 31) 1
(1, 0) 1
(1, 2) 1
(1, 3) 1
(1, 7) 1
(1, 13) 1
(1, 17) 1
(1, 19) 1
(1, 21) 1
(1, 30) 1
(2, 0) 1
(2, 1) 1
(2, 3) 1
(2, 7) 1
(2, 8) 1
(2, 9) 1
(2, 13) 1
...
nthreads 1
result: 0
nthreads 1
nthreads 1
nthreads 1
result 0
BF_basic time: 1.000000e-03 (sec), rate: 0.156 (1e6 edges/sec)
speedup of BF_basic: 0
BF_pure_c_double : 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_pure_c: -nan(ind)
BF_full_mxv time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_full_mxv: -nan(ind)
BF_basic_mxv time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic_mxv: -nan(ind)
transpose time: 0
==========input graph: nodes: 67 edges: 294 source node: 0
BF_full1 time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full1a time: 1.000000e-03 (sec), rate: 0.294 (1e6 edges/sec)
BF_full2 time: 1.000000e-03 (sec), rate: 0.294 (1e6 edges/sec)
BF_full time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
t(BF_full1) / t(BF_full): -nan(ind)
pure_c integer:
[ FAILED ]
Case karate.mtx:
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
Matrix: west0067.mtx
GrB_FP64 matrix: 67-by-67 entries: 294
(0, 7) -0.834182
(0, 12) 1.26582
(0, 17) -0.336156
(1, 8) -0.834182
(1, 13) 1.01266
(1, 17) -0.29392
(2, 9) -0.834182
(2, 14) 0.759494
(2, 17) -0.221481
(3, 10) -0.834182
(3, 15) 0.506329
(3, 17) -0.118986
(4, 0) -0.278842
(4, 1) -0.8
(4, 6) 0.134462
(4, 7) 0.4
(4, 12) 0.4
(5, 0) -0.268019
(5, 2) -0.8
(5, 6) 0.117568
(5, 8) 0.4
(5, 13) 0.4
(6, 0) -0.232372
(6, 3) -0.8
(6, 6) 0.0885926
(6, 9) 0.4
(6, 14) 0.4
(7, 0) -0.157508
(7, 4) -0.8
(7, 6) 0.0475944
(7, 10) 0.4
(7, 15) 0.4
...
nthreads 1
result: 0
Case west0067.mtx:
test_BF.c:187: Check result == valid... failed
nthreads 1
nthreads 1
nthreads 1
result 1
BF_basic time: 1.000000e-03 (sec), rate: 0.294 (1e6 edges/sec)
speedup of BF_basic: 0
BF_pure_c_double : 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_pure_c: -nan(ind)
BF_full_mxv time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_full_mxv: -nan(ind)
BF_basic_mxv time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic_mxv: -nan(ind)
BF_full1 time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full1a time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full2 time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
t(BF_full1) / t(BF_full): -nan(ind)
-------------------------- A = abs (A)
nthreads 1
result: 0
nthreads 1
nthreads 1
nthreads 1
result 0
BF_basic time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic: -nan(ind)
BF_pure_c_double : 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_pure_c: -nan(ind)
BF_full_mxv time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_full_mxv: -nan(ind)
BF_basic_mxv time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic_mxv: -nan(ind)
transpose time: 0
==========input graph: nodes: 7 edges: 12 source node: 0
BF_full1 time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full1a time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full2 time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
t(BF_full1) / t(BF_full): -nan(ind)
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
test_BF.c:399: Check di == d[i]... failed
Matrix: matrix_int8.mtx
GrB_INT8 matrix: 7-by-7 entries: 12
(0, 1) 127
(0, 3) 7
(1, 4) 5
(1, 6) 8
(2, 5) 1
(3, 0) -128
(3, 2) 0
(4, 5) 7
(5, 2) 5
(6, 2) 9
(6, 3) 1
(6, 4) 1
nthreads 1
result: 0
Case matrix_int8.mtx:
test_BF.c:187: Check result == valid... failed
nthreads 1
nthreads 1
nthreads 1
result 1
BF_basic time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic: -nan(ind)
BF_pure_c_double : 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_pure_c: -nan(ind)
BF_full_mxv time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_full_mxv: -nan(ind)
BF_basic_mxv time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic_mxv: -nan(ind)
BF_full1 time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full1a time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full2 time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
t(BF_full1) / t(BF_full): -nan(ind)
pure_c integer:
-------------------------- A = abs (A)
nthreads 1
result: 0
nthreads 1
nthreads 1
nthreads 1
result 0
BF_basic time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic: -nan(ind)
BF_pure_c_double : 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_pure_c: -nan(ind)
BF_full_mxv time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_full_mxv: -nan(ind)
BF_basic_mxv time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic_mxv: -nan(ind)
pure_c integer:
test_BF.c:399: Check di == d[i]... failed
FAILED: 1 of 1 unit tests has failed.
Start 86: LAGraphX_msf
2/2 Test #86: LAGraphX_msf .....................***Failed 0.19 sec
Test msf...
================================== A.mtx:
result: 0
msf (known result):
GrB_UINT64 matrix: 7-by-7 entries: 6
(1, 0) 1
(2, 0) 1
(3, 1) 1
(4, 1) 1
(5, 1) 1
(6, 0) 1
[ FAILED ]
Case A.mtx:
test_msf.c:115: Check ok... failed
msf:
GrB_UINT64 matrix: 7-by-7 entries: 0
================================== jagmesh7.mtx:
result: 0
msf:
GrB_UINT64 matrix: 1138-by-1138 entries: 5
(551, 552) 1
(670, 671) 1
(712, 722) 1
(733, 743) 1
(817, 816) 1
================================== west0067.mtx:
result: 0
msf:
GrB_UINT64 matrix: 67-by-67 entries: 0
================================== bcsstk13.mtx:
result: 0
msf:
GrB_UINT64 matrix: 2003-by-2003 entries: 6
(1554, 1559) 0
(1556, 1561) 0
(1742, 1747) 0
(1744, 1748) 0
(1831, 1833) 0
(1932, 1934) 0
================================== karate.mtx:
result: 0
msf:
GrB_UINT64 matrix: 34-by-34 entries: 1
(23, 27) 1
================================== ldbc-cdlp-undirected-example.mtx:
result: 0
msf:
GrB_UINT64 matrix: 8-by-8 entries: 0
================================== ldbc-undirected-example-bool.mtx:
result: 0
msf:
GrB_UINT64 matrix: 9-by-9 entries: 0
================================== ldbc-undirected-example-unweighted.mtx:
result: 0
msf:
GrB_UINT64 matrix: 9-by-9 entries: 0
================================== ldbc-undirected-example.mtx:
result: 0
msf:
GrB_UINT64 matrix: 9-by-9 entries: 0
================================== ldbc-wcc-example.mtx:
result: 0
msf:
GrB_UINT64 matrix: 10-by-10 entries: 0
Test msf_errors... [ OK ]
FAILED: 1 of 2 unit tests has failed.
0% tests passed, 2 tests failed out of 2
Total Test time (real) = 0.29 sec
The following tests FAILED:
70 - LAGraphX_BF (Failed)
86 - LAGraphX_msf (Failed)
Errors while running CTest
I'm struggling to read the output of the failing tests. Do they show what the expected result is and what the actual result is instead?
No, they just show that the test failed. I would need to add more printf's to do that.
I did add some to the test_BF. It showed that the expected values for some d were finite, like 1 or 2, while the computed result was +infinity, which in this case means it was missing in the result (the result vector d was supposed to be full but it was returned sparse). That's very strange, and I didn't dig any deeper once I saw that the stable code also failed in the same way.
On the off-chance that this would make a difference, I tried again after MSYS2 updated to LLVM 18.1.2: Still the same failing tests in the CLANG32 environment with that version.
Thanks for checking it.
It would be a difficult issue for me to track down to find the specific place in GraphBLAS where the compiler is failing, since I don't have a simple way to replicate this problem on my side. Even if I did, it would be very slow for me since I don't use Windows at all.
Hopefully a future version of LLVM will not have this problem.
Github broke its mingw(clang32) runner.
The stable branch CI worked fine on March 2, 2024. The same CI fails on March 21, 2024. Nothing changed in the meantime, SuiteSparse and its .github/workflow files were unchanged. What did change was the github runner. Github switched the mingw(clang32) runner, and changed clang and openmp from 17. to 18.. Something broke, and it's not SuiteSparse.
See this update, which disables the mingw(clang32) tests: https://github.com/DrTimothyAldenDavis/SuiteSparse/pull/778/commits/b1bd9ccc5838178698daaf7c082371c27f86f3e2
The latest dev2 code breaks in the same way as the stable branch now breaks, with 4 test failures in LAGraph. One is an "OMP: out of heap memory" error, which is very strange since the problems being solved are very small.
Once the github runner is fixed, the above change to the SuiteSparse/.github/workflow/build.yaml file can be restored to its original state.