emusolutions / LAGraph

This is a library plus a test harness for collecting algorithms that use the GraphBLAS
Other
0 stars 1 forks source link

LC demo multinode crash #16

Closed jamesETsmith closed 1 year ago

jamesETsmith commented 1 year ago

Summary

During benchmarking this morning, I noticed that the bfs_demo crashes when run on multiple nodes in the simulator, but not in single node. I don't think this is related to #15.

Details

TLDR we're crashing in a lambda inside LGB::matrix_multiply<>.

Here's the crash report:

 /tools/lucata/bin/emusim.x --total_nodes 2 -- ./build_lc_e23a5bf/src/benchmark/bfs_demo /net/bigtwin-d/data/graph500/graph500-scale10.mtx

        SystemC 2.3.3-Accellera --- Apr 21 2023 11:46:50
        Copyright (c) 1996-2018 by all Contributors,
        ALL RIGHTS RESERVED
Selected mode is not availble on this architecture. Setting mode to GrB_BLOCKING.
[ERROR]: Failure in address translation: addr larger than total system bytes.
        addr_in=0xe80002800002dc0, total_system_bytes=0x2000000000
EXCEPTION!
ThreadID=21168
HW ThreadID=0x195f30897173
Thread using HW ThreadID
ThreadletState=Service request
ThreadletException=5=Address
         Exception cause string: Translation failure
ExecutionType=7
Current Instruction:
805351dc        LDE:    iToken=172      iLength=3       nibbles=b1d000
Threadlet TCB Data:
TCB.(TPC)=(0x805351dc) (32 bits each)
TCB.(D,D2)=(1,0) (one bit each)
TCB.A2=1
TCB.(TS,TSDATA)=(0,0x0) (two bits, four bits)
TCB.AID=0x1 (8 bits)
TCB.(NaN,U,V,CB,N,Z)=(0, 0, 0, 0, 0, 0)
TCB.M=0 (one bit)

Threadlet State Registers
TCB0: 0x000bffff64000200
TCB1: 0x00000000805351dc

Threadlet Data Registers
A: 0xe80002800002dc0=1044835285348658624
A2: 0x800000005b8870=36028797024962672
Format: signed decimal, unsigned decimal, hex
D: 108086466218939976,  108086466218939976, 0x18000118001d648
D2: 0, 0, 0x0
E[0] (Live): 108086466218819600, 108086466218819600, 0x180001180000010
E[1] (Live): 49, 49, 0x31
E[2] (Live): 49, 49, 0x31
E[3] (Live): 108086466218939584, 108086466218939584, 0x18000118001d4c0
E[4] (Live): 108086393204486360, 108086393204486360, 0x18000008001b0d8
E[5] (Live): 36028814198837368, 36028814198837368, 0x80000400001078
E[6] (Live): 108086393204430704, 108086393204430704, 0x18000008000d770
E[7] (Live): 36028814198833952, 36028814198833952, 0x80000400000320
E[8] (Live): 108086393204392768, 108086393204392768, 0x180000080004340
E[9] (Live): 2153494866, 2153494866, 0x805bb952
E[10] (Live): 2153494866, 2153494866, 0x805bb952
E[11] (Live): 2148281592, 2148281592, 0x800c2cf8
E[12] (Live): 1044835285348658624, 1044835285348658624, 0xe80002800002dc0
E[13] (Live): 108086466218940608, 108086466218940608, 0x18000118001d8c0
E[14] (Live): 108086466218939584, 108086466218939584, 0x18000118001d4c0
E[15] (Live): 108086466219063224, 108086466219063224, 0x18000118003b7b8

Other Useful Data
Fence Counter=0
Source Node=1
Dest Node=-1

Manual debugging:

 grep 805351dc bfs.objdump
    4029a8ee:   805351dc:       CMPE    1
 gdb -q build_lc_e23a5bf/src/benchmark/bfs_demo
Reading symbols from build_lc_e23a5bf/src/benchmark/bfs_demo...(no debugging symbols found)...done.
(gdb) x/i 0x4029a8ee
   0x4029a8ee <@_ZZZN3LGB15matrix_multiplyINS_5bytesILi8EEENS1_ILi1EEES2_NS_25generic_binary_op_functorES4_EE8GrB_InfoRNS_6MatrixERKS6_S9_T2_T3_RKT_ENKUlmE_clEmENKUlmE_clEm.cilkhelper+1007>:
    mov    $0xdf,%cl
(gdb) demangle -l c++ _ZZZN3LGB15matrix_multiplyINS_5bytesILi8EEENS1_ILi1EEES2_NS_25generic_binary_op_functorES4_EE8GrB_InfoRNS_6MatrixERKS6_S9_T2_T3_RKT_ENKUlmE_clEmENKUlmE_clEm
LGB::matrix_multiply<LGB::bytes<8>, LGB::bytes<1>, LGB::bytes<8>, LGB::generic_binary_op_functor, LGB::generic_binary_op_functor>(LGB::Matrix&, LGB::Matrix const&, LGB::Matrix const&, LGB::generic_binary_op_functor, LGB::generic_binary_op_functor, LGB::bytes<8> const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const::{lambda(unsigned long)#1}::operator()(unsigned long) const
jamesETsmith commented 1 year ago

An update on this, I only see LC crashes when simulating with 2 nodes, I do not them when simulating with 4, 8, or 16. :exploding_head:

jamesETsmith commented 1 year ago

This should be fixed in with the latest LGB (71e4298).