jamesETsmith commented 1 year ago

General

I'm seeing an error (double free or corruption detected) when running LAGraph's bfs benchmark (bfs_demo) with certain matrices. The code runs fine when I pass it the smaller matrix west0067.mtx.

emu_multinode_exec 0 -- ./bfs_demo ../data/jagmesh7.mtx 
DSHPATH is /root/dev/sc-driver-pfi/build/src:/root/dev/sc-driver-pfi/build/scripts:/root/dev/chick-helpers-pfi/build/src:/root/dev/chick-helpers-pfi/build/scripts:/root/dev/pf-ctrl/ssr/src:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/root/.local/bin:/root/bin
[STATUS]: Running a single node execution.
[STATUS]: Checking nodes to ensure ok to run...
[STATUS]: Running emu_handler_and_loader 
Selected mode is not availble on this architecture. Setting mode to GrB_BLOCKING.
!**** double free or corruption detected at 0x1800001000d8290) ****!
[FATAL]: in /root/dev/sc-driver-pfi/src/runtime_services.c:316:runtime_free_error() !**** double free or corruption detected at 0x1800001000d8290) ****!
[FATAL]: in /root/dev/sc-driver-pfi/src/handler_top.c:354:seq_handler() Runtime Service failure.
[STATUS]: Run complete; gathering logs.
[STATUS]: Moving concatenated logs with PID=67394 into /root/jsmith/LAGraph.
[STATUS]: Checking mn_exec_usr.67394.log for problems...
[ERROR]: Problems found:
"[FATAL]: in /root/dev/sc-driver-pfi/src/runtime_services.c:316:runtime_free_error() !**** double free or corruption detected at 0x1800001000d8290) ****!
[FATAL]: in /root/dev/sc-driver-pfi/src/handler_top.c:354:seq_handler() Runtime Service failure.
[STATUS]: consumer_exit_cond = SFR_CONSUMER_ERROR."
[STATUS]: Checking mn_exec_sys.67394.log for problems...
[ERROR]: Problems found:
"[FATAL]: in /root/dev/sc-driver-pfi/src/runtime_services.c:316:runtime_free_error() !**** double free or corruption detected at 0x1800001000d8290) ****!
[FATAL]: in /root/dev/sc-driver-pfi/src/handler_top.c:354:seq_handler() Runtime Service failure.
[STATUS]: consumer_exit_cond = SFR_CONSUMER_ERROR."
[STATUS]: emu_multinode_exec complete.

Details

I'm running on n0 on condor6, where all nodes are configures as single nodes.
Working directory: /root/jsmith/LAGraph

grep except_code /var/log/emutechnology/root/emu-scd-system.log | uniq -c
      2     TCB.TS_data (except_code)=0x1, CAUSE=0
     34     TCB.TS_data (except_code)=0xb, CAUSE=0

Here's the output after the first [FATAL] signal (not sure if this counts as "interesting", but I don't have any better ideas)

[FATAL]: in /root/dev/sc-driver-pfi/src/runtime_services.c:316:runtime_free_error() !**** double free or corruption detected at 0x1800001000d8290) ****!
A=0x0180_0001_000d_8290
Raw TCB0=0x000b_ffff_7200_1e00
Raw TCB1=0x041c_0000_80c2_7f55
Printing selected TCB fields...
    TCB.FenceCnt=0x0
    TCB.AID=0x3
    TCB.TS (thread state)=0x1
    TCB.TS_data (except_code)=0x1, CAUSE=0
    TCB.Z=0, N=0, CB=0, V=0, U=0, NaN=0
    TCB.M=0, WO=0x0, NoM=0x1, RND=0x0
    TCB.D2=1, D=1, A2=1, EMask=0xffff
    TCB.TSR_LEN=0xb
    TCB.TPC=0x80c27f55
    TCB.LAST_NODE=0x0000, LAST_CONTEXT=0x1c, LAST_GC=0x10
    TCB.RTS=0, TL=0, SP=0, MG=0, QD=0
D=0x0000_0000_0000_0004
D2=0x0080_0000_0061_faf0
A2=0x0080_0000_0061_faf0
E[0]=0x0180_0002_8000_32b0
E[1]=0x0000_0000_8009_a7dd
E[2]=0x0000_0000_0000_0004
E[3]=0x0180_0001_0001_acf8
E[4]=0x0000_0000_0000_0000
E[5]=0x0180_0001_000d_8288
E[6]=0x0080_0000_0061_f3a0
E[7]=0x0080_0000_0061_f3a8
E[8]=0x0080_0000_0061_fab0
E[9]=0x0180_0001_000d_8288
E[10]=0x0180_0003_0000_8928
E[11]=0x0180_0000_0066_5748
E[12]=0x0180_0001_000d_92f0
E[13]=0x0180_0001_000d_8640
E[14]=0x0180_0001_000d_8288
E[15]=0x0180_0001_000d_8ee0
ThreadID=0x0000_0cb1_cf2f_c78e

[FATAL]: in /root/dev/sc-driver-pfi/src/handler_top.c:354:seq_handler() Runtime Service failure.

Here's the full error from mn_exec_sys.67394.log

Simulator

/tools/lucata/bin/emusim.x -- ./build_lc/src/benchmark/bfs_demo data/jagmesh7.mtx 

        SystemC 2.3.3-Accellera --- Mar 16 2023 18:23:11
        Copyright (c) 1996-2018 by all Contributors,
        ALL RIGHTS RESERVED
Selected mode is not availble on this architecture. Setting mode to GrB_BLOCKING.
RUNTIME FREE ERROR!
**Check Address (A) Register**
THROWING ILLEGAL EXCEPTION!

tdysart commented 1 year ago

From what I've seen so far, it looks like multiple threads have called LGB::nonzero_block_list_ref<false>::merge_blocks with the same rhs_ptr.

jamesETsmith commented 1 year ago

Thanks @tdysart, do you think this is distinct from the problem Matthew's been seeing?

jamesETsmith commented 1 year ago

At @mcordery's suggestion, I tried checking if the pointer passed to free_block was null before freeing (rather than just freeing):

template <>
void nonzero_block_list_ref<false>::free_block(nonzero_block *ptr) {
    if (ptr != nullptr) {
        free(reinterpret_cast<void *>(ptr));
    }
}

This leads to a new runtime error on the simulator. Not sure if it's before or after the original error.

[ERROR]: Failure in address translation: shared bit wasn't set.
        addr_in=0x30000010012a228, addr=0x30000010012a228
EXCEPTION!
ThreadID=4119
HW ThreadID=0xea770534c42
Thread using HW ThreadID
ThreadletState=Service request
ThreadletException=5=Address
         Exception cause string: Translation failure
ExecutionType=8
Current Instruction:
80c26caa        WRD:    iToken=134      iLength=2       nibbles=980000
Threadlet TCB Data:
TCB.(TPC)=(0x80c26caa) (32 bits each)
TCB.(D,D2)=(1,1) (one bit each)
TCB.A2=1 
TCB.(TS,TSDATA)=(0,0x0) (two bits, four bits)
TCB.AID=0x1 (8 bits)
TCB.(NaN,U,V,CB,N,Z)=(0, 0, 0, 0, 0, 0)
TCB.M=0 (one bit)

Threadlet State Registers
TCB0: 0x000cffff74000200
TCB1: 0x0000000080c26caa

tdysart commented 1 year ago

The addr_in in the second line looks like an address where two pointers have been added together.

mcordery commented 1 year ago

Just out of curiosity I spent some time doing some checks on hardware and with clang. With clang I did a number of the fsanitize checks and nothing popped up and neither did doing an update on valgrind and using the latest version. I also tried ramping back the optimization levels to 01 and -g in the hopes that maybe something would happen but no. Oddly, compiling with -O0 resulted in a crashed compiler (floating point exception...)

tdysart commented 1 year ago

I had to change strategies to work on @mcordery's executable, but it did show up as two threads executing the same merge_blocks function with the same rhs_ptr. The lhs_ptr does differ in the two threads.

Does the MIT based OpenCilk toolchain still have any of the cilk tools they previously did? IIRC, they had some sanitizing and race detection tools (we never ported them over to our approach) -- might be the wrong tree though...

tdysart commented 1 year ago

I tried doing some simulator tweaks to track the addresses of the failures I saw - however, it seemed like every run ended up with a different address at fault. And now I've tripped across the same bad address issue that @jamesETsmith saw yesterday (basically a view 3 pointer without the shared bit set - looks like two view 1's that got added together)...and the TPC for the bad address issue ended up in malloc instead of free....

The lack of repeatability here (in both test mwx's) is disconcerting...is there anything in LGB that would likely influence/cause that?

jamesETsmith commented 1 year ago

I just made some progress on this after talking with @mcordery, whose problem was caused by a missing call to sort_and_merge_allrows(). It looks like I've got some matrices that aren't sorted and should be. I've gotten things running by liberally sprinkling sort_and_merge_allrows() throughout LGB, but now I need to figure out where it's missing.

I've also noticed that half of the runs dump info and others crash with a more terse exception (might not be the right word here). Not that I'm pretty sure this is a problem with unsorted data within the matrix, my best guess is that our troublesome unsorted matrix gets generated/modified in parallel leading to a variety of different states before it finally goes to matrix_multiply (where the error shows up). I'll keep you posted about the progress.

tdysart commented 1 year ago

Sounds good.

Yeah, emusim is definitely not dumping enough info on the runtime free error - will be fixing that in the very near future (need to make sure other runtime errors dump an appropriate amount of info as well).

emusolutions / LAGraph

LC runtime failures for BFS benchmark #6

General

Details

Simulator