Closed jamesETsmith closed 1 year ago
From what I've seen so far, it looks like multiple threads have called LGB::nonzero_block_list_ref<false>::merge_blocks
with the same rhs_ptr
.
Thanks @tdysart, do you think this is distinct from the problem Matthew's been seeing?
At @mcordery's suggestion, I tried checking if the pointer passed to free_block
was null before freeing (rather than just freeing):
template <>
void nonzero_block_list_ref<false>::free_block(nonzero_block *ptr) {
if (ptr != nullptr) {
free(reinterpret_cast<void *>(ptr));
}
}
This leads to a new runtime error on the simulator. Not sure if it's before or after the original error.
[ERROR]: Failure in address translation: shared bit wasn't set.
addr_in=0x30000010012a228, addr=0x30000010012a228
EXCEPTION!
ThreadID=4119
HW ThreadID=0xea770534c42
Thread using HW ThreadID
ThreadletState=Service request
ThreadletException=5=Address
Exception cause string: Translation failure
ExecutionType=8
Current Instruction:
80c26caa WRD: iToken=134 iLength=2 nibbles=980000
Threadlet TCB Data:
TCB.(TPC)=(0x80c26caa) (32 bits each)
TCB.(D,D2)=(1,1) (one bit each)
TCB.A2=1
TCB.(TS,TSDATA)=(0,0x0) (two bits, four bits)
TCB.AID=0x1 (8 bits)
TCB.(NaN,U,V,CB,N,Z)=(0, 0, 0, 0, 0, 0)
TCB.M=0 (one bit)
Threadlet State Registers
TCB0: 0x000cffff74000200
TCB1: 0x0000000080c26caa
The addr_in in the second line looks like an address where two pointers have been added together.
Just out of curiosity I spent some time doing some checks on hardware and with clang. With clang I did a number of the fsanitize checks and nothing popped up and neither did doing an update on valgrind and using the latest version. I also tried ramping back the optimization levels to 01 and -g in the hopes that maybe something would happen but no. Oddly, compiling with -O0 resulted in a crashed compiler (floating point exception...)
I had to change strategies to work on @mcordery's executable, but it did show up as two threads executing the same merge_blocks
function with the same rhs_ptr
. The lhs_ptr
does differ in the two threads.
Does the MIT based OpenCilk toolchain still have any of the cilk tools they previously did? IIRC, they had some sanitizing and race detection tools (we never ported them over to our approach) -- might be the wrong tree though...
I tried doing some simulator tweaks to track the addresses of the failures I saw - however, it seemed like every run ended up with a different address at fault. And now I've tripped across the same bad address issue that @jamesETsmith saw yesterday (basically a view 3 pointer without the shared bit set - looks like two view 1's that got added together)...and the TPC for the bad address issue ended up in malloc instead of free....
The lack of repeatability here (in both test mwx's) is disconcerting...is there anything in LGB that would likely influence/cause that?
I just made some progress on this after talking with @mcordery, whose problem was caused by a missing call to sort_and_merge_allrows()
. It looks like I've got some matrices that aren't sorted and should be. I've gotten things running by liberally sprinkling sort_and_merge_allrows()
throughout LGB, but now I need to figure out where it's missing.
I've also noticed that half of the runs dump info and others crash with a more terse exception (might not be the right word here). Not that I'm pretty sure this is a problem with unsorted data within the matrix, my best guess is that our troublesome unsorted matrix gets generated/modified in parallel leading to a variety of different states before it finally goes to matrix_multiply
(where the error shows up). I'll keep you posted about the progress.
Sounds good.
Yeah, emusim is definitely not dumping enough info on the runtime free error - will be fixing that in the very near future (need to make sure other runtime errors dump an appropriate amount of info as well).
General
I'm seeing an error (
double free or corruption detected
) when running LAGraph's bfs benchmark (bfs_demo
) with certain matrices. The code runs fine when I pass it the smaller matrixwest0067.mtx
.Details
n0
on condor6, where all nodes are configures as single nodes.Here's the output after the first [FATAL] signal (not sure if this counts as "interesting", but I don't have any better ideas)
Here's the full error from mn_exec_sys.67394.log
Simulator