emusolutions / LAGraph

This is a library plus a test harness for collecting algorithms that use the GraphBLAS
Other
0 stars 1 forks source link

`ctest_TriangleCount` LC Crash #13

Closed jamesETsmith closed 1 year ago

jamesETsmith commented 1 year ago

General

This is a follow-up to #4.

Details

Here's a complete error:


26: Test command: /usr/bin/cmake "-E" "env" "/tools/lucata/bin/emusim.x" "--forward_return_value" "--" "test_TriangleCount" "--no-exec"
26: Test timeout computed to be: 10000000
26:
26:         SystemC 2.3.3-Accellera --- Apr 21 2023 11:46:50
26:         Copyright (c) 1996-2018 by all Contributors,
26:         ALL RIGHTS RESERVED
26: Test TriangleCount_Methods1...                  Selected mode is not availble on this architecture. Setting mode to GrB_BLOCKING.[ OK ]
26: Test TriangleCount_Methods2...                  Selected mode is not availble on this architecture. Setting mode to GrB_BLOCKING.[ OK ]
26: Test TriangleCount_Methods3...                  Selected mode is not availble on this architecture. Setting mode to GrB_BLOCKING.[ OK ]
26: Test TriangleCount_Methods4...                  Selected mode is not availble on this architecture. Setting mode to GrB_BLOCKING.[ OK ]
26: Test TriangleCount_Methods5...                  Selected mode is not availble on this architecture. Setting mode to GrB_BLOCKING.[ OK ]
26: Test TriangleCount_Methods6...                  Selected mode is not availble on this architecture. Setting mode to GrB_BLOCKING.[ OK ]
26: Test TriangleCount...                           Selected mode is not availble on this architecture. Setting mode to GrB_BLOCKING.[ FAILED ]
26:   test_TriangleCount.c:275: Check ntriangles == 45... failed
26: Test TriangleCount_many...                      [ERROR]: Failure in address translation: addr larger than total system bytes.
26:     addr_in=0x180080888172e20, total_system_bytes=0x1000000000
26: EXCEPTION!
26: ThreadID=288201
26: HW ThreadID=0x34ffc8de18d
26: Thread using HW ThreadID
26: ThreadletState=Service request
26: ThreadletException=5=Address
26:      Exception cause string: Translation failure
26: ExecutionType=13
26: Current Instruction:
26: 800c0be1    ADDM:   iToken=16       iLength=2       nibbles=120000
26: Threadlet TCB Data:
26: TCB.(TPC)=(0x800c0be1) (32 bits each)
26: TCB.(D,D2)=(1,0) (one bit each)
26: TCB.A2=1
26: TCB.(TS,TSDATA)=(0,0x0) (two bits, four bits)
26: TCB.AID=0x1 (8 bits)
26: TCB.(NaN,U,V,CB,N,Z)=(0, 0, 0, 0, 0, 0)
26: TCB.M=0 (one bit)
26:
26: Threadlet State Registers
26: TCB0: 0x0009f07f64000200
26: TCB1: 0x00000000800c0be1
26:
26: Threadlet Data Registers
26: A: 0x180080888172e20=108095223792872992
26: A2: 0x80000000860848=36028797027747912
26: Format: signed decimal, unsigned decimal, hex
26: D: 1,  1, 0x1
26: D2: 0, 0, 0x0
26: E[0] (Live): 108086401794390976, 108086401794390976, 0x180000280013bc0
26: E[1] (Live): 108086393205368336, 108086393205368336, 0x1800000800f2610
26: E[2] (Live): 1183, 1183, 0x49f
26: E[3] (Live): 1281, 1281, 0x501
26: E[4] (Live): 99, 99, 0x63
26: E[5] (Live): 108086393205434288, 108086393205434288, 0x1800000801027b0
26: E[6] (Live): 108086393205368336, 108086393205368336, 0x1800000800f2610
26: E[7] (Inactive): 0, 0, 0x0
26: E[8] (Inactive): 0, 0, 0x0
26: E[9] (Inactive): 0, 0, 0x0
26: E[10] (Inactive): 0, 0, 0x0
26: E[11] (Inactive): 0, 0, 0x0
26: E[12] (Live): 108086393205368336, 108086393205368336, 0x1800000800f2610
26: E[13] (Live): 108086401794390984, 108086401794390984, 0x180000280013bc8
26: E[14] (Live): 0, 0, 0x0
26: E[15] (Live): 1204, 1204, 0x4b4
26:
26: Other Useful Data
26: Fence Counter=0
26: Source Node=0
26: Dest Node=-1
1/1 Test #26: ctest_TriangleCount ..............***Failed  199.58 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) = 199.58 sec

The following tests FAILED:
         26 - ctest_TriangleCount (Failed)

Here's the manual debugging for the TPC when the crash occurs:

❯ /tools/lucata/bin/gossamer64-objdump -xD build_lc/src/test/test_TriangleCount > tc.objdump
❯ grep 800c0be1 tc.objdump
    400605f1:   800c0be1:       ADDM
❯ gdb -q build_lc/src/test/test_TriangleCount
Reading symbols from build_lc/src/test/test_TriangleCount...(no debugging symbols found)...done.
(gdb) x/i 0x400605f1
   0x400605f1 <@_Z17matrix_extractCSRIbE8GrB_InfoPmS1_PT_S1_P16GB_Matrix_opaque.outline_.ls1.1.cilkhelper+363>: das
(gdb) demangle -l c++ _Z17matrix_extractCSRIbE8GrB_InfoPmS1_PT_S1_P16GB_Matrix_opaque
GrB_Info matrix_extractCSR<bool>(unsigned long*, unsigned long*, bool*, unsigned long*, GB_Matrix_opaque*)
mcordery commented 1 year ago

Does this only fail on hw when you're running multinode?

jamesETsmith commented 1 year ago

@mcordery no it crashes for single node

mcordery commented 1 year ago

@jamesETsmith it seems to work for x86. How are you building for GC these days?

jamesETsmith commented 1 year ago

@mcordery the README.md is up-to-date with instructions (I hope, lmk if they aren't). Here's the condensed version:

# Build LucataGraphBLAS for LC
cmake -B build_lc <other cmake args>    # configure
cmake --build build_lc --parallel 16    # build
cmake --build build_lc --target install # installs LGB in build/install

# Build LAGraph against LAGraph for LC
cmake -B build_lc -DGRAPHBLAS_ROOT=/path/to/LucataGraphBLAS/build_lc/install \
    -DCMAKE_C_COMPILER=/tools/lucata/bin/emu-cc.sh \
    -DCMAKE_CXX_COMPILER=/tools/lucata/bin/emu-cc.sh
cmake --build build_lc --parallel 16

Just a fyi, all the tests except for test_ConnectedComponents work on x86.

mcordery commented 1 year ago

That's what I figured but I was running other LC tests with emusim fine but TriangleCount kept giving me a 'could not fork' error

jamesETsmith commented 1 year ago

@mcordery are you using the latest version of our LAGraph? The could not fork problem should be fixed (#7).

mcordery commented 1 year ago

I did a pull on it. Guess I should just torch it and do a clean build.

jamesETsmith commented 1 year ago

If that doesn't fix it, just let me know and we can move this discussion to slack and start troubleshooting.