emusolutions / LAGraph

This is a library plus a test harness for collecting algorithms that use the GraphBLAS
Other
0 stars 0 forks source link

`ctest_ConnectedComponents` LC Crash #9

Open jamesETsmith opened 1 year ago

jamesETsmith commented 1 year ago

General

This is a spinoff of https://github.com/emusolutions/LAGraph/issues/4 to tackle the problems with ctest_ConnectedComponents.

Details

 ctest --test-dir build_lc -R ConnectedComponents -V
Internal ctest changing into directory: /net/hyper120h-d/data/jsmith/apps/LAGraph/build_lc
UpdateCTestConfiguration  from :/net/hyper120h-d/data/jsmith/apps/LAGraph/build_lc/DartConfiguration.tcl
UpdateCTestConfiguration  from :/net/hyper120h-d/data/jsmith/apps/LAGraph/build_lc/DartConfiguration.tcl
Test project /net/hyper120h-d/data/jsmith/apps/LAGraph/build_lc
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 8
    Start 8: ctest_ConnectedComponents

8: Test command: /usr/bin/cmake "-E" "env" "/tools/lucata/bin/emusim.x" "--forward_return_value" "--" "test_ConnectedComponents" "--no-exec"
8: Test timeout computed to be: 10000000
8:
8:         SystemC 2.3.3-Accellera --- Apr 21 2023 11:46:50
8:         Copyright (c) 1996-2018 by all Contributors,
8:         ALL RIGHTS RESERVED
8: Test cc...                                      Selected mode is not availble on this architecture. Setting mode to GrB_BLOCKING.
8: Matrix: karate.mtx
8:
8: --- CC: FastSV6 if SuiteSparse, Boruvka if vanilla:
8: [ FAILED ]
8:   Case karate.mtx:
8:     test_ConnectedComponents.c:112: Check LAGr_ConnectedComponents (&C, G, msg) == 0... failed
8:     test_ConnectedComponents.c:113: Check LAGraph_Vector_Print (C, 2, (stdout), msg) == 0... failed
8:     test_ConnectedComponents.c:66: Check GrB_Vector_size (&n, C) == 0... failed
8: # components:      0 Matrix: karate.mtx
8:     test_ConnectedComponents.c:118: Check ncomponents == ncomp... failed
8:     test_ConnectedComponents.c:120: Check GrB_Vector_nvals (&cnvals, C) == 0... failed
8:     test_ConnectedComponents.c:121: Check cnvals == n... failed
8:     test_ConnectedComponents.c:124: Check LG_check_cc (C, G, msg) == 0... failed
8:
8: ------ CC_BORUVKA:
8:     test_ConnectedComponents.c:138: Check LG_CC_Boruvka (&C2, G, msg) == 0... failed
8:     test_ConnectedComponents.c:66: Check GrB_Vector_size (&n, C) == 0... failed
8:     test_ConnectedComponents.c:140: Check [ERROR]: Failure in address translation: shared bit wasn't set.
8:      addr_in=0x20, addr=0x20
8: EXCEPTION!
8: ThreadID=0
8: HW ThreadID=0x1
8: Thread using HW ThreadID
8: ThreadletState=Service request
8: ThreadletException=5=Address
8:       Exception cause string: Translation failure
8: ExecutionType=7
8: Current Instruction:
8: 801073c2     LDE:    iToken=172      iLength=3       nibbles=b7d000
8: Threadlet TCB Data:
8: TCB.(TPC)=(0x801073c2) (32 bits each)
8: TCB.(D,D2)=(1,1) (one bit each)
8: TCB.A2=1
8: TCB.(TS,TSDATA)=(0,0x0) (two bits, four bits)
8: TCB.AID=0x1 (8 bits)
8: TCB.(NaN,U,V,CB,N,Z)=(0, 0, 0, 0, 0, 0)
8: TCB.M=0 (one bit)
8:
8: Threadlet State Registers
8: TCB0: 0x000cffff74000200
8: TCB1: 0x00000000801073c2
8:
8: Threadlet Data Registers
8: A: 0x20=32
8: A2: 0x1800002000060c0=108086399646851264
8: Format: signed decimal, unsigned decimal, hex
8: D: 720,  720, 0x2d0
8: D2: 108086403941841112, 108086403941841112, 0x18000030000b8d8
8: E[0] (Live): 108086406089284896, 108086406089284896, 0x180000380001d20
8: E[1] (Live): 0, 0, 0x0
8: E[2] (Live): -3, 18446744073709551613, 0xfffffffffffffffd
8: E[3] (Live): 108086393204401568, 108086393204401568, 0x1800000800065a0
8: E[4] (Live): 1, 1, 0x1
8: E[5] (Live): 0, 0, 0x0
8: E[6] (Live): 0, 0, 0x0
8: E[7] (Live): 1, 1, 0x1
8: E[8] (Live): 108086393204572816, 108086393204572816, 0x180000080030290
8: E[9] (Live): 0, 0, 0x0
8: E[10] (Live): 0, 0, 0x0
8: E[11] (Live): 108086399646851264, 108086399646851264, 0x1800002000060c0
8: E[12] (Live): 36028814198835496, 36028814198835496, 0x80000400000928
8: E[13] (Live): 108086406089285440, 108086406089285440, 0x180000380001f40
8: E[14] (Live): 108086406089285472, 108086406089285472, 0x180000380001f60
8: E[15] (Live): 108086406089285456, 108086406089285456, 0x180000380001f50
8:
8: Other Useful Data
8: Fence Counter=0
8: Source Node=0
8: Dest Node=-1
1/1 Test #8: ctest_ConnectedComponents ........***Failed    6.11 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   6.11 sec

The following tests FAILED:
          8 - ctest_ConnectedComponents (Failed)
Errors while running CTest
Output from these tests are in: /net/hyper120h-d/data/jsmith/apps/LAGraph/build_lc/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
skuntz commented 1 year ago

Do I remember correctly that Connected Components requires a user-defined type? Could that be part of the failure?

jamesETsmith commented 1 year ago

Good memory, it does require a user-defined select op https://github.com/emusolutions/LAGraph/blob/ab0d521d1a30746f75014470bc4517a6d60d920e/src/algorithm/LG_CC_Boruvka.c#L72-L77

I'll leave this issue here for future reference until we implement user-defined select ops

jamesETsmith commented 1 year ago

Interestingly, I compiled the non-vanilla version of LAGr_ConnectedComponents which uses LG_CC_FastSV6 since it doesn't rely on user-defined operations. However, the tests crashes almost immediately and I think it's because we don't implement certain variants of GxB_Matrix_unpack_CSC. This might be worth looking into because I think it would be easier to implement the GxB_Matrix_unpack_CSC variants than user-defined operations and would yield a faster method anyway.

Here's a record of the failures and the eventual crash when running this test with LG_CC_FastSV6 as the backend method:

 ctest --test-dir build_lc2 -R ConnectedComp -V
Internal ctest changing into directory: /net/hyper120h-d/data/jsmith/apps/LAGraph/build_lc2
UpdateCTestConfiguration  from :/net/hyper120h-d/data/jsmith/apps/LAGraph/build_lc2/DartConfiguration.tcl
UpdateCTestConfiguration  from :/net/hyper120h-d/data/jsmith/apps/LAGraph/build_lc2/DartConfiguration.tcl
Test project /net/hyper120h-d/data/jsmith/apps/LAGraph/build_lc2
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 8
    Start 8: ctest_ConnectedComponents

8: Test command: /usr/bin/cmake "-E" "env" "/tools/lucata/bin/emusim.x" "--forward_return_value" "--" "test_ConnectedComponents" "--no-exec"
8: Test timeout computed to be: 10000000
8:
8:         SystemC 2.3.3-Accellera --- Apr 21 2023 11:46:50
8:         Copyright (c) 1996-2018 by all Contributors,
8:         ALL RIGHTS RESERVED
8: Test cc...                                      [ERROR]: Failure in address translation: shared bit wasn't set.
8:      addr_in=0x20, addr=0x20
8: EXCEPTION!
8: ThreadID=0
8: HW ThreadID=0x1
8: Thread using HW ThreadID
8: ThreadletState=Service request
8: ThreadletException=5=Address
8:       Exception cause string: Translation failure
8: ExecutionType=7
8: Current Instruction:
8: 801072a2     LDE:    iToken=172      iLength=3       nibbles=b7d000
8: Threadlet TCB Data:
8: TCB.(TPC)=(0x801072a2) (32 bits each)
8: TCB.(D,D2)=(1,1) (one bit each)
8: TCB.A2=1
8: TCB.(TS,TSDATA)=(0,0x0) (two bits, four bits)
8: TCB.AID=0x1 (8 bits)
8: TCB.(NaN,U,V,CB,N,Z)=(0, 0, 0, 0, 0, 0)
8: TCB.M=0 (one bit)
8:
8: Threadlet State Registers
8: TCB0: 0x000cffff74000200
8: TCB1: 0x00000000801072a2
8:
8: Threadlet Data Registers
8: A: 0x20=32
8: A2: 0x1800000800041b0=108086393204392368
8: Format: signed decimal, unsigned decimal, hex
8: D: 720,  720, 0x2d0
8: D2: 108086401794340760, 108086401794340760, 0x180000280007798
8: E[0] (Live): 108086393204395088, 108086393204395088, 0x180000080004c50
8: E[1] (Live): 0, 0, 0x0
8: E[2] (Live): -3, 18446744073709551613, 0xfffffffffffffffd
8: E[3] (Live): 108086393204403584, 108086393204403584, 0x180000080006d80
8: E[4] (Live): 1, 1, 0x1
8: E[5] (Live): 0, 0, 0x0
8: E[6] (Live): 0, 0, 0x0
8: E[7] (Live): 1, 1, 0x1
8: E[8] (Live): 108086393204498080, 108086393204498080, 0x18000008001dea0
8: E[9] (Live): 0, 0, 0x0
8: E[10] (Live): 0, 0, 0x0
8: E[11] (Live): 108086393204392368, 108086393204392368, 0x1800000800041b0
8: E[12] (Live): 36028814198833824, 36028814198833824, 0x800004000002a0
8: E[13] (Live): 108086393204395632, 108086393204395632, 0x180000080004e70
8: E[14] (Live): 108086393204395664, 108086393204395664, 0x180000080004e90
8: E[15] (Live): 108086393204395648, 108086393204395648, 0x180000080004e80
8:
8: Other Useful Data
8: Fence Counter=0
8: Source Node=0
8: Dest Node=-1
1/1 Test #8: ctest_ConnectedComponents ........***Failed    2.10 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   2.10 sec

The following tests FAILED:
          8 - ctest_ConnectedComponents (Failed)
Errors while running CTest
Output from these tests are in: /net/hyper120h-d/data/jsmith/apps/LAGraph/build_lc2/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely

Here's the followup "debugging" analysis:

❯ /tools/lucata/bin/gossamer64-objdump -xD build_lc2/src/test/test_ConnectedComponents > cc.objdump
❯ grep 801072a2 cc.objdump
    40083951:   801072a2:       LDE     7
❯ gdb -q build_lc/src/test/test_ConnectedComponents
Reading symbols from build_lc/src/test/test_ConnectedComponents...(no debugging symbols found)...done.
(gdb) x/i 0x40083951
   0x40083951 <@GrB_Matrix_assign_INT32+137>:   push   %ds
jamesETsmith commented 1 year ago

As a correction, the non-vanilla version only compiles because the actual code for the non-vanilla implementation was inside #ifdef LAGRAPH_SUITESPARSE statements so it wasn't actually getting compiled. Once I remove those I run into a compilation error because we're missing GxB_MIN_SECONDI_INT64. We have GxB_MIN_SECOND_INT64 already, I'm going to scope out the difference between the two semirings, if it's easy to add I'll do that to see how much farther we get through compilation.