Closed jamesETsmith closed 1 year ago
@skuntz forgot to tag you on this issue yesterday. These are the problems we discussed offline yesterday.
I'll take a look at this next.
:rage: the fork errors were caused by the test framework (acutest
) used by LAGraph trying to create child processes to run different subtests. You can turn off this child process creation with the --no-exec
flag for the test executables. Since it's a pain to remember them I've just added the flag when we run the tests with cmake, but you need to add them manually if you are running the tests outside fo ctest. See eff66ca for details.
After adding the appropriate flags, here's the remaining test failures:
The following tests FAILED:
1 - ctest_Betweenness (Failed)
2 - ctest_BreadthFirstSearch (Failed)
3 - ctest_Cached_AT (Failed)
7 - ctest_CheckGraph (Failed)
8 - ctest_ConnectedComponents (Failed)
9 - ctest_DeleteCached (Failed)
10 - ctest_DisplayGraph (Failed)
12 - ctest_Init_errors (Failed)
13 - ctest_IsEqual (Failed)
15 - ctest_MMRead (Failed)
17 - ctest_Matrix_Structure (Failed)
19 - ctest_New (Failed)
21 - ctest_PageRank (Failed)
23 - ctest_SingleSourceShortestPath (Failed)
24 - ctest_Sort (Failed)
25 - ctest_SortByDegree (Failed)
26 - ctest_TriangleCount (Failed)
27 - ctest_Type (Failed)
29 - ctest_Vector_Structure (Failed)
35 - ctest_minmax (Failed)
Here is some selected info from the test log (mostly for me):
Test Init_errors... RUNTIME FREE ERROR!
Test test_AT... [ERROR]: Failure in address translation: shared bit wasn't set.
[ERROR]: Failure in address translation: shared bit wasn't set.
Test BreadthFirstSearch_invalid_src... RUNTIME FREE ERROR!
Test test_bc... [ERROR]: Failure in address translation: shared bit wasn't set.
Test cc... [ERROR]: Failure in address translation: shared bit wasn't set.
Test MMRead... [ERROR]: Failure in address translation: shared bit wasn't set.
Test Matrix_Structure... [ERROR]: Failure in address translation: shared bit wasn't set.
Test IsEqual... [ERROR]: Failure in address translation: shared bit wasn't set.
Test SSSP... [ERROR]: Failure in address translation: shared bit wasn't set.
Test test_ranker... [ERROR]: Failure in address translation: shared bit wasn't set.
msg: LAGraph failure (file /net/hyper120h-c/data/jsmith/apps/LAGraph/src/utility/LAGraph_CheckGraph.c, line 113): in_degree has wrong type; must be GrB_INT64RUNTIME FREE ERROR!
Test TypeSize... RUNTIME FREE ERROR!
Test New_failures... RUNTIME FREE ERROR!
Test Vector_Structure_failures... RUNTIME FREE ERROR!
Test TriangleCount_Methods1... [ERROR]: Failure in address translation: shared bit wasn't set.
([ERROR]: Failure in address translation: shared bit wasn't set.
result: -1000, msg: LAGraph failure (file /net/hyper120h-c/data/jsmith/apps/LAGraph/src/utility/LAGraph_CheckGraph.c, line 85): A and AT must have the same typeRUNTIME FREE ERROR!
Test test_sort2... RUNTIME FREE ERROR!
(RUNTIME FREE ERROR!
The address failures appear to be in @void LGB::builtin_cast<float, float>(std::byte, std::byte const) called from matrix reduce. I'm adding my debugging path for test_PageRank for reference.
TEST OUTPUT
skuntz@hyper120h-a test]$ /tools/lucata-23.R1.NoCxxUtils/bin/emusim.x -- test_PageRank
SystemC 2.3.3-Accellera --- Feb 1 2023 10:15:06
Copyright (c) 1996-2018 by all Contributors,
ALL RIGHTS RESERVED
Start untimed simulation with local date and time= Fri Mar 17 14:49:45 2023
Test test_ranker... [ERROR]: Failure in address translation: shared bit wasn't set.
addr_in=0x22, addr=0x22
EXCEPTION!
ThreadID=2760
HW ThreadID=0xdf51097e95a
Thread using HW ThreadID
ThreadletState=Service request
ThreadletException=5=Address
Exception cause string: Translation failure
ExecutionType=7
Current Instruction:
808c5a46 LD32A: iToken=178 iLength=3 nibbles=b5e000
Threadlet TCB Data:
TCB.(TPC)=(0x808c5a46) (32 bits each)
TCB.(D,D2)=(1,1) (one bit each)
TCB.A2=1
TCB.(TS,TSDATA)=(0,0x0) (two bits, four bits)
TCB.AID=0x1 (8 bits)
TCB.(NaN,U,V,CB,N,Z)=(0, 0, 0, 0, 0, 0)
TCB.M=0 (one bit)
Threadlet State Registers
TCB0: 0x000cffff74000200
TCB1: 0x00000000808c5a46
Threadlet Data Registers
A: 0x22=34
A2: 0x1800000006b2410=108086391063913488
Format: signed decimal, unsigned decimal, hex
D: 2156681796, 2156681796, 0x808c5a44
D2: 36028797025750032, 36028797025750032, 0x80000000678c10
E[0] (Live): 108086399646875120, 108086399646875120, 0x18000020000bdf0
E[1] (Live): 2160330053, 2160330053, 0x80c40545
E[2] (Live): 108086399646875136, 108086399646875136, 0x18000020000be00
E[3] (Live): 34, 34, 0x22
E[4] (Live): 108086399646875136, 108086399646875136, 0x18000020000be00
E[5] (Live): 1, 1, 0x1
E[6] (Live): 0, 0, 0x0
E[7] (Live): 1024, 1024, 0x400
E[8] (Live): 108086393204384256, 108086393204384256, 0x180000080002200
E[9] (Live): 108086393204384264, 108086393204384264, 0x180000080002208
E[10] (Live): 108086393204384240, 108086393204384240, 0x1800000800021f0
E[11] (Live): 108086393204385728, 108086393204385728, 0x1800000800027c0
E[12] (Live): 2, 2, 0x2
E[13] (Live): 1, 1, 0x1
E[14] (Live): 1024, 1024, 0x400
E[15] (Live): 108086393204384208, 108086393204384208, 0x1800000800021d0
Other Useful Data
Fence Counter=0
Source Node=0
Dest Node=-1
End untimed simulation with local date and time= Fri Mar 17 14:49:47 2023
GENERATE OBJECT DUMP
/tools/lucata-23.R1.NoCxxUtils/bin/gossamer64-objdump -xD test_PageRank > test_PageRank.od
SEARCH FOR TPC = 0x808c5a46
0000000040462d22 <@_ZN3LGB12builtin_castIffEEvPSt4bytePKS1_>:
40462d22: 808c5a44: ETA 3
40462d23: 808c5a46: LD32A
40462d25: 808c5a49: ETA 2
40462d26: 808c5a4b: ST32
40462d27: 808c5a4d: JMPE 1
Use C++ name demangler to get function name: http://demangler.com/
@void LGB::builtin_cast<float, float>(std::byte*, std::byte const*)
FIND CALLING FUNCTION E1 register from error information has function return TPC = 0x80c40545 Search for that in the object dump, then work your way backwards to the @ symbol to find the function name
@_ZZN3LGB13matrix_reduceINS_5bytesILi4EEENS_25generic_binary_op_functorES2_EE8GrB_InfoRT1_RKNS_6MatrixERKS5_T0_ENUlRT_E_clIN3emu13striped_arrayIPS2_EEEEDaSE_.cilkhelper
after demangling
@auto LGB::matrix_reduce<LGB::bytes<4>, LGB::generic_binary_op_functor, LGB::bytes<4> >(LGB::bytes<4>&, LGB::Matrix const&, LGB::bytes<4> const&, LGB::generic_binary_op_functor)::{lambda(auto:1&)#1}::operator()<emu::striped_array<LGB::bytes<4>*> >(emu::striped_array<LGB::bytes<4>*>&).cilkhelper
I also looked at test_Betweenness and the error is also in
@void LGB::builtin_cast<float, float>(std::byte*, std::byte const*)
called from
@auto LGB::matrix_reduce<LGB::bytes<4>, LGB::generic_binary_op_functor, LGB::bytes<8> >(LGB::bytes<8>&, LGB::Matrix const&, LGB::bytes<8> const&, LGB::generic_binary_op_functor)::{lambda(auto:1&)#1}::operator()<emu::striped_array<LGB::bytes<8>*> >(emu::striped_array<LGB::bytes<8>*>&).cilkhelper
Although notice that the sizes for matrix reduce in Page Rank were all 4 bytes and this one is a combination of 4 bytes and 8 bytes. Appears to be related to the conversion between float and byte but I haven't dug in any further than that.
Note I am getting the same error in the simulator for LGB testVSel.mwx (in builtin cast). Makes me wonder if something changed with a recent commit.
[skuntz@hyper120h-a CTest]$ /tools/lucata/bin/emusim.x -- testVSel.mwx
SystemC 2.3.3-Accellera --- Nov 15 2022 13:46:43
Copyright (c) 1996-2018 by all Contributors,
ALL RIGHTS RESERVED
Start untimed simulation with local date and time= Sun Mar 19 20:47:53 2023
Suite base path is /home/skuntz/LGBdev/CTest/suite/regression
Running testVSel:
testVSel: STARTED: CD_L0_T0_D1
[ERROR]: Failure in address translation: shared bit wasn't set.
addr_in=0x6, addr=0x6
EXCEPTION!
ThreadID=4456
HW ThreadID=0x9205a6d7865
Thread using HW ThreadID
ThreadletState=Service request
ThreadletException=5=Address
Exception cause string: Translation failure
ExecutionType=7
Current Instruction:
8087ae74 LD8A: iToken=178 iLength=3 nibbles=b3e000
Threadlet TCB Data:
TCB.(TPC)=(0x8087ae74) (32 bits each)
TCB.(D,D2)=(1,1) (one bit each)
TCB.A2=1
TCB.(TS,TSDATA)=(0,0x0) (two bits, four bits)
TCB.AID=0x1 (8 bits)
TCB.(NaN,U,V,CB,N,Z)=(0, 0, 0, 0, 0, 0)
TCB.M=0 (one bit)
Threadlet State Registers
TCB0: 0x000cffff74000200
TCB1: 0x000000008087ae74
Threadlet Data Registers
A: 0x6=6
A2: 0x800000005b48a0=36028797024946336
Format: signed decimal, unsigned decimal, hex
D: 2156375666, 2156375666, 0x8087ae72
D2: 36028797024946336, 36028797024946336, 0x800000005b48a0
E[0] (Live): 108086403941890016, 108086403941890016, 0x1800003000177e0
E[1] (Live): 2151321889, 2151321889, 0x803a9121
E[2] (Live): 108086403941890032, 108086403941890032, 0x1800003000177f0
E[3] (Live): 6, 6, 0x6
E[4] (Live): 108086403941890032, 108086403941890032, 0x1800003000177f0
E[5] (Live): 1, 1, 0x1
E[6] (Live): 0, 0, 0x0
E[7] (Live): 1024, 1024, 0x400
E[8] (Live): 108086393204552032, 108086393204552032, 0x18000008002b160
E[9] (Live): 108086393204552040, 108086393204552040, 0x18000008002b168
E[10] (Live): 108086393204552016, 108086393204552016, 0x18000008002b150
E[11] (Live): 108086399646832752, 108086399646832752, 0x180000200001870
E[12] (Live): 2, 2, 0x2
E[13] (Live): 1, 1, 0x1
E[14] (Live): 1024, 1024, 0x400
E[15] (Live): 108086393204551984, 108086393204551984, 0x18000008002b130
Other Useful Data
Fence Counter=0
Source Node=0
Dest Node=-1
End untimed simulation with local date and time= Sun Mar 19 20:47:57 2023
Great find @skuntz! I can look into the LGB side of things
Just for reference here, I see 55 tests failed out of 67
and 50 of those failures show the [ERROR]: Failure in address translation: shared bit wasn't set.
error.
Many of these issues were resolved by emusolutions/LucataGraphBLAS#264, we now have "only" 15 failing tests:
53% tests passed, 17 tests failed out of 36
Total Test time (real) = 2764.05 sec
The following tests FAILED:
2 - ctest_BreadthFirstSearch (Failed)
7 - ctest_CheckGraph (Failed)
8 - ctest_ConnectedComponents (Failed)
9 - ctest_DeleteCached (Failed)
10 - ctest_DisplayGraph (Failed)
12 - ctest_Init_errors (Failed)
13 - ctest_IsEqual (Failed)
15 - ctest_MMRead (Failed)
17 - ctest_Matrix_Structure (Failed)
19 - ctest_New (Failed)
23 - ctest_SingleSourceShortestPath (Failed)
24 - ctest_Sort (Failed)
25 - ctest_SortByDegree (Failed)
26 - ctest_TriangleCount (Failed)
27 - ctest_Type (Failed)
29 - ctest_Vector_Structure (Failed)
35 - ctest_minmax (Failed)
I'm going to investigate which of these are coming from the SelectOp LC problems.
Several of the failing tests were caused by a problem in LAGraph (see GraphBLAS/LAGraph#184).
So far the only failing tests are shown below (test_BreadthFirstSearch
and test_MMRead
did not finish in time).
14/36 Test #12: ctest_Init_errors .................***Failed 0.33 sec
20/36 Test #23: ctest_SingleSourceShortestPath ....***Failed 1.26 sec
21/36 Test #8: ctest_ConnectedComponents .........***Failed 20.68 sec
25/36 Test #25: ctest_SortByDegree ................***Failed 65.25 sec
26/36 Test #26: ctest_TriangleCount ...............***Failed 97.72 sec
29/36 Test #9: ctest_DeleteCached ................***Failed 214.66 sec
EDIT: test_MMRead
passes when omitting the larger matrices: olm1000.mtx
, bcsstk13.mtx
, cryg2500.mtx
, tree-example.mtx
, and west0067.mtx
Since I broke this issue into smaller ones, I'm closing it.
General
I set out to check which LAGraph tests succeed/fail for LC builds and found that many of them (~24 out of 36) crash. I get several errors some of which I show below. A third of the tests (~12 out of 36) genuinely pass and several of the demos run so it's not like all things LC are failing.
Details
Here are some of the errors I'm seeing:
Type 1: Failure in address translation
Type 2: Trying to fork
Type 3: LAGraph Assertion Failures
Type 4: