emusolutions / LAGraph

This is a library plus a test harness for collecting algorithms that use the GraphBLAS
Other
0 stars 1 forks source link

Can't run tests for LC build #4

Closed jamesETsmith closed 1 year ago

jamesETsmith commented 1 year ago

General

I set out to check which LAGraph tests succeed/fail for LC builds and found that many of them (~24 out of 36) crash. I get several errors some of which I show below. A third of the tests (~12 out of 36) genuinely pass and several of the demos run so it's not like all things LC are failing.

Details

Here are some of the errors I'm seeing:

Type 1: Failure in address translation

1: Test test_bc...                                 [ERROR]: Failure in address translation: shared bit wasn't set.
1:  addr_in=0x22, addr=0x22
1: EXCEPTION!

Type 2: Trying to fork

2:   Cannot fork. Invalid argument [22]

Type 3: LAGraph Assertion Failures

4: Test test_Degree...                             Selected mode is not availble on this architecture. Setting mode to GrB_BLOCKING.
4: status: -2, msg: LAGraph assertion "G != ((void*)0)" failed (file /net/hyper120h-c/data/jsmith/apps/LAGraph/src/utility/LAGraph_Cached_OutDegree.c, line 49): status: -2

Type 4:

    (RUNTIME FREE ERROR!
**Check Address (A) Register**
THROWING ILLEGAL EXCEPTION!
jamesETsmith commented 1 year ago

@skuntz forgot to tag you on this issue yesterday. These are the problems we discussed offline yesterday.

skuntz commented 1 year ago

I'll take a look at this next.

jamesETsmith commented 1 year ago

:rage: the fork errors were caused by the test framework (acutest) used by LAGraph trying to create child processes to run different subtests. You can turn off this child process creation with the --no-exec flag for the test executables. Since it's a pain to remember them I've just added the flag when we run the tests with cmake, but you need to add them manually if you are running the tests outside fo ctest. See eff66ca for details.

After adding the appropriate flags, here's the remaining test failures:

The following tests FAILED:
      1 - ctest_Betweenness (Failed)
      2 - ctest_BreadthFirstSearch (Failed)
      3 - ctest_Cached_AT (Failed)
      7 - ctest_CheckGraph (Failed)
      8 - ctest_ConnectedComponents (Failed)
      9 - ctest_DeleteCached (Failed)
     10 - ctest_DisplayGraph (Failed)
     12 - ctest_Init_errors (Failed)
     13 - ctest_IsEqual (Failed)
     15 - ctest_MMRead (Failed)
     17 - ctest_Matrix_Structure (Failed)
     19 - ctest_New (Failed)
     21 - ctest_PageRank (Failed)
     23 - ctest_SingleSourceShortestPath (Failed)
     24 - ctest_Sort (Failed)
     25 - ctest_SortByDegree (Failed)
     26 - ctest_TriangleCount (Failed)
     27 - ctest_Type (Failed)
     29 - ctest_Vector_Structure (Failed)
     35 - ctest_minmax (Failed)

Here is some selected info from the test log (mostly for me):

Test Init_errors...                             RUNTIME FREE ERROR!
Test test_AT...                                 [ERROR]: Failure in address translation: shared bit wasn't set.
[ERROR]: Failure in address translation: shared bit wasn't set.
Test BreadthFirstSearch_invalid_src...          RUNTIME FREE ERROR!
Test test_bc...                                 [ERROR]: Failure in address translation: shared bit wasn't set.
Test cc...                                      [ERROR]: Failure in address translation: shared bit wasn't set.
Test MMRead...                                  [ERROR]: Failure in address translation: shared bit wasn't set.
Test Matrix_Structure...                        [ERROR]: Failure in address translation: shared bit wasn't set.
Test IsEqual...                                 [ERROR]: Failure in address translation: shared bit wasn't set.
Test SSSP...                                    [ERROR]: Failure in address translation: shared bit wasn't set.
Test test_ranker...                             [ERROR]: Failure in address translation: shared bit wasn't set.
msg: LAGraph failure (file /net/hyper120h-c/data/jsmith/apps/LAGraph/src/utility/LAGraph_CheckGraph.c, line 113): in_degree has wrong type; must be GrB_INT64RUNTIME FREE ERROR!
Test TypeSize...                                RUNTIME FREE ERROR!
Test New_failures...                            RUNTIME FREE ERROR!
Test Vector_Structure_failures...               RUNTIME FREE ERROR!
Test TriangleCount_Methods1...                  [ERROR]: Failure in address translation: shared bit wasn't set.
    ([ERROR]: Failure in address translation: shared bit wasn't set.
result: -1000, msg: LAGraph failure (file /net/hyper120h-c/data/jsmith/apps/LAGraph/src/utility/LAGraph_CheckGraph.c, line 85): A and AT must have the same typeRUNTIME FREE ERROR!
Test test_sort2...                              RUNTIME FREE ERROR!
    (RUNTIME FREE ERROR!
skuntz commented 1 year ago

The address failures appear to be in @void LGB::builtin_cast<float, float>(std::byte, std::byte const) called from matrix reduce. I'm adding my debugging path for test_PageRank for reference.

TEST OUTPUT

skuntz@hyper120h-a test]$ /tools/lucata-23.R1.NoCxxUtils/bin/emusim.x -- test_PageRank 

        SystemC 2.3.3-Accellera --- Feb  1 2023 10:15:06
        Copyright (c) 1996-2018 by all Contributors,
        ALL RIGHTS RESERVED
Start untimed simulation with local date and time= Fri Mar 17 14:49:45 2023

Test test_ranker...                             [ERROR]: Failure in address translation: shared bit wasn't set.
    addr_in=0x22, addr=0x22
EXCEPTION!
ThreadID=2760
HW ThreadID=0xdf51097e95a
Thread using HW ThreadID
ThreadletState=Service request
ThreadletException=5=Address
     Exception cause string: Translation failure
ExecutionType=7
Current Instruction:
808c5a46    LD32A:  iToken=178  iLength=3   nibbles=b5e000
Threadlet TCB Data:
TCB.(TPC)=(0x808c5a46) (32 bits each)
TCB.(D,D2)=(1,1) (one bit each)
TCB.A2=1 
TCB.(TS,TSDATA)=(0,0x0) (two bits, four bits)
TCB.AID=0x1 (8 bits)
TCB.(NaN,U,V,CB,N,Z)=(0, 0, 0, 0, 0, 0)
TCB.M=0 (one bit)

Threadlet State Registers
TCB0: 0x000cffff74000200
TCB1: 0x00000000808c5a46

Threadlet Data Registers
A: 0x22=34
A2: 0x1800000006b2410=108086391063913488
Format: signed decimal, unsigned decimal, hex
D: 2156681796,  2156681796, 0x808c5a44
D2: 36028797025750032, 36028797025750032, 0x80000000678c10
E[0] (Live): 108086399646875120, 108086399646875120, 0x18000020000bdf0
E[1] (Live): 2160330053, 2160330053, 0x80c40545
E[2] (Live): 108086399646875136, 108086399646875136, 0x18000020000be00
E[3] (Live): 34, 34, 0x22
E[4] (Live): 108086399646875136, 108086399646875136, 0x18000020000be00
E[5] (Live): 1, 1, 0x1
E[6] (Live): 0, 0, 0x0
E[7] (Live): 1024, 1024, 0x400
E[8] (Live): 108086393204384256, 108086393204384256, 0x180000080002200
E[9] (Live): 108086393204384264, 108086393204384264, 0x180000080002208
E[10] (Live): 108086393204384240, 108086393204384240, 0x1800000800021f0
E[11] (Live): 108086393204385728, 108086393204385728, 0x1800000800027c0
E[12] (Live): 2, 2, 0x2
E[13] (Live): 1, 1, 0x1
E[14] (Live): 1024, 1024, 0x400
E[15] (Live): 108086393204384208, 108086393204384208, 0x1800000800021d0

Other Useful Data
Fence Counter=0
Source Node=0
Dest Node=-1
End untimed simulation with local date and time= Fri Mar 17 14:49:47 2023

GENERATE OBJECT DUMP

/tools/lucata-23.R1.NoCxxUtils/bin/gossamer64-objdump -xD test_PageRank > test_PageRank.od

SEARCH FOR TPC = 0x808c5a46

0000000040462d22 <@_ZN3LGB12builtin_castIffEEvPSt4bytePKS1_>:
    40462d22:   808c5a44:   ETA 3
    40462d23:   808c5a46:   LD32A
    40462d25:   808c5a49:   ETA 2
    40462d26:   808c5a4b:   ST32
    40462d27:   808c5a4d:   JMPE    1

Use C++ name demangler to get function name: http://demangler.com/

@void LGB::builtin_cast<float, float>(std::byte*, std::byte const*)

FIND CALLING FUNCTION E1 register from error information has function return TPC = 0x80c40545 Search for that in the object dump, then work your way backwards to the @ symbol to find the function name

@_ZZN3LGB13matrix_reduceINS_5bytesILi4EEENS_25generic_binary_op_functorES2_EE8GrB_InfoRT1_RKNS_6MatrixERKS5_T0_ENUlRT_E_clIN3emu13striped_arrayIPS2_EEEEDaSE_.cilkhelper

after demangling

@auto LGB::matrix_reduce<LGB::bytes<4>, LGB::generic_binary_op_functor, LGB::bytes<4> >(LGB::bytes<4>&, LGB::Matrix const&, LGB::bytes<4> const&, LGB::generic_binary_op_functor)::{lambda(auto:1&)#1}::operator()<emu::striped_array<LGB::bytes<4>*> >(emu::striped_array<LGB::bytes<4>*>&).cilkhelper

I also looked at test_Betweenness and the error is also in

@void LGB::builtin_cast<float, float>(std::byte*, std::byte const*)

called from

@auto LGB::matrix_reduce<LGB::bytes<4>, LGB::generic_binary_op_functor, LGB::bytes<8> >(LGB::bytes<8>&, LGB::Matrix const&, LGB::bytes<8> const&, LGB::generic_binary_op_functor)::{lambda(auto:1&)#1}::operator()<emu::striped_array<LGB::bytes<8>*> >(emu::striped_array<LGB::bytes<8>*>&).cilkhelper

Although notice that the sizes for matrix reduce in Page Rank were all 4 bytes and this one is a combination of 4 bytes and 8 bytes. Appears to be related to the conversion between float and byte but I haven't dug in any further than that.

skuntz commented 1 year ago

Note I am getting the same error in the simulator for LGB testVSel.mwx (in builtin cast). Makes me wonder if something changed with a recent commit.

[skuntz@hyper120h-a CTest]$ /tools/lucata/bin/emusim.x -- testVSel.mwx

        SystemC 2.3.3-Accellera --- Nov 15 2022 13:46:43
        Copyright (c) 1996-2018 by all Contributors,
        ALL RIGHTS RESERVED
Start untimed simulation with local date and time= Sun Mar 19 20:47:53 2023

Suite base path is /home/skuntz/LGBdev/CTest/suite/regression
Running testVSel:
testVSel: STARTED: CD_L0_T0_D1
[ERROR]: Failure in address translation: shared bit wasn't set.
        addr_in=0x6, addr=0x6
EXCEPTION!
ThreadID=4456
HW ThreadID=0x9205a6d7865
Thread using HW ThreadID
ThreadletState=Service request
ThreadletException=5=Address
         Exception cause string: Translation failure
ExecutionType=7
Current Instruction:
8087ae74        LD8A:   iToken=178      iLength=3       nibbles=b3e000
Threadlet TCB Data:
TCB.(TPC)=(0x8087ae74) (32 bits each)
TCB.(D,D2)=(1,1) (one bit each)
TCB.A2=1
TCB.(TS,TSDATA)=(0,0x0) (two bits, four bits)
TCB.AID=0x1 (8 bits)
TCB.(NaN,U,V,CB,N,Z)=(0, 0, 0, 0, 0, 0)
TCB.M=0 (one bit)

Threadlet State Registers
TCB0: 0x000cffff74000200
TCB1: 0x000000008087ae74

Threadlet Data Registers
A: 0x6=6
A2: 0x800000005b48a0=36028797024946336
Format: signed decimal, unsigned decimal, hex
D: 2156375666,  2156375666, 0x8087ae72
D2: 36028797024946336, 36028797024946336, 0x800000005b48a0
E[0] (Live): 108086403941890016, 108086403941890016, 0x1800003000177e0
E[1] (Live): 2151321889, 2151321889, 0x803a9121
E[2] (Live): 108086403941890032, 108086403941890032, 0x1800003000177f0
E[3] (Live): 6, 6, 0x6
E[4] (Live): 108086403941890032, 108086403941890032, 0x1800003000177f0
E[5] (Live): 1, 1, 0x1
E[6] (Live): 0, 0, 0x0
E[7] (Live): 1024, 1024, 0x400
E[8] (Live): 108086393204552032, 108086393204552032, 0x18000008002b160
E[9] (Live): 108086393204552040, 108086393204552040, 0x18000008002b168
E[10] (Live): 108086393204552016, 108086393204552016, 0x18000008002b150
E[11] (Live): 108086399646832752, 108086399646832752, 0x180000200001870
E[12] (Live): 2, 2, 0x2
E[13] (Live): 1, 1, 0x1
E[14] (Live): 1024, 1024, 0x400
E[15] (Live): 108086393204551984, 108086393204551984, 0x18000008002b130

Other Useful Data
Fence Counter=0
Source Node=0
Dest Node=-1
End untimed simulation with local date and time= Sun Mar 19 20:47:57 2023
jamesETsmith commented 1 year ago

Great find @skuntz! I can look into the LGB side of things

jamesETsmith commented 1 year ago

Just for reference here, I see 55 tests failed out of 67 and 50 of those failures show the [ERROR]: Failure in address translation: shared bit wasn't set. error.

jamesETsmith commented 1 year ago

Many of these issues were resolved by emusolutions/LucataGraphBLAS#264, we now have "only" 15 failing tests:

53% tests passed, 17 tests failed out of 36

Total Test time (real) = 2764.05 sec

The following tests FAILED:
      2 - ctest_BreadthFirstSearch (Failed)
      7 - ctest_CheckGraph (Failed)
      8 - ctest_ConnectedComponents (Failed)
      9 - ctest_DeleteCached (Failed)
     10 - ctest_DisplayGraph (Failed)
     12 - ctest_Init_errors (Failed)
     13 - ctest_IsEqual (Failed)
     15 - ctest_MMRead (Failed)
     17 - ctest_Matrix_Structure (Failed)
     19 - ctest_New (Failed)
     23 - ctest_SingleSourceShortestPath (Failed)
     24 - ctest_Sort (Failed)
     25 - ctest_SortByDegree (Failed)
     26 - ctest_TriangleCount (Failed)
     27 - ctest_Type (Failed)
     29 - ctest_Vector_Structure (Failed)
     35 - ctest_minmax (Failed)

I'm going to investigate which of these are coming from the SelectOp LC problems.

jamesETsmith commented 1 year ago

Several of the failing tests were caused by a problem in LAGraph (see GraphBLAS/LAGraph#184).

So far the only failing tests are shown below (test_BreadthFirstSearch and test_MMRead did not finish in time).

14/36 Test #12: ctest_Init_errors .................***Failed    0.33 sec
20/36 Test #23: ctest_SingleSourceShortestPath ....***Failed    1.26 sec
21/36 Test  #8: ctest_ConnectedComponents .........***Failed   20.68 sec
25/36 Test #25: ctest_SortByDegree ................***Failed   65.25 sec
26/36 Test #26: ctest_TriangleCount ...............***Failed   97.72 sec
29/36 Test  #9: ctest_DeleteCached ................***Failed  214.66 sec

EDIT: test_MMRead passes when omitting the larger matrices: olm1000.mtx, bcsstk13.mtx, cryg2500.mtx, tree-example.mtx, and west0067.mtx

jamesETsmith commented 1 year ago

Since I broke this issue into smaller ones, I'm closing it.