accre / lstore

LStore - A fault-tolerant, performant distributed data storage framework.
http://www.lstore.org
Apache License 2.0
4 stars 5 forks source link

Somehow pigeon coops don't allocate locks #146

Open PerilousApricot opened 7 years ago

PerilousApricot commented 7 years ago

Got this bizarre dump:

#0  __GI___pthread_mutex_lock (mutex=0x8) at pthread_mutex_lock.c:50
#1  0x00007f2f3da53de9 in tbx_pch_reserve (pc=0x7f2f4d384de0)
    at /usr/src/debug/LStore-0.5.1_387_g5dadef4aedabe1ccfc54a08b9147d60b_dev/src/toolbox/pigeon_coop.c:224
#2  0x00007f2f3dc910d7 in gop_init (gop=gop@entry=0x7f2f240008c0)
    at /usr/src/debug/LStore-0.5.1_387_g5dadef4aedabe1ccfc54a08b9147d60b_dev/src/gop/gop.c:690
#3  0x00007f2f3dc96082 in init_opque (q=q@entry=0x7f2f240008c0)
    at /usr/src/debug/LStore-0.5.1_387_g5dadef4aedabe1ccfc54a08b9147d60b_dev/src/gop/opque.c:241
#4  0x00007f2f3dc961ac in gop_opque_new ()
    at /usr/src/debug/LStore-0.5.1_387_g5dadef4aedabe1ccfc54a08b9147d60b_dev/src/gop/opque.c:269
#5  0x00007f2f3dc9e8ab in ongoing_heartbeat_thread (th=<optimized out>, data=0x7f2f4d3eef90)
    at /usr/src/debug/LStore-0.5.1_387_g5dadef4aedabe1ccfc54a08b9147d60b_dev/src/gop/mq_ongoing.c:88
#6  0x00007f2f48a71dc5 in start_thread (arg=0x7f2f28fe9700) at pthread_create.c:308
#7  0x00007f2f4a639ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) frame 1
#1  0x00007f2f3da53de9 in tbx_pch_reserve (pc=0x7f2f4d384de0)
    at /usr/src/debug/LStore-0.5.1_387_g5dadef4aedabe1ccfc54a08b9147d60b_dev/src/toolbox/pigeon_coop.c:224
224     apr_thread_mutex_lock(pc->lock);
(gdb) p pc
$1 = (tbx_pc_t *) 0x7f2f4d384de0
(gdb) p *pc
$2 = {lock = 0x0, pool = 0x7f2f4d390858, nshelves = 1, shelf_size = 50, item_size = 32, check_shelf = 0, nused = 0, 
  name = 0x7f2f3dcb4f56 "gop_control", new_arg = 0x0, ph_shelf = 0x7f2f4d384e40, data_shelf = 0x7f2f4d384e60, 
  new = 0x7f2f3dc95660 <gop_control_new>, free = 0x7f2f3dc957d0 <gop_control_free>}

Not entirely sure how lock is null...

tacketar commented 7 years ago

Yes!!!! That makes me feel better. You had several other issues all related to pigeon* routines that I couldn't fathom. I spent several hours looking at them and they all looked impossible. Every race incurred INSIDE the same lock in both threads. So I started looking for unbalanced lock/unlock but didn't find any... and that code isn't very complicated. The only thing I could think of was another memory stomp. Having a NULL lock makes me confident that's the problem.

BTW I've patched all the issues other than these.

tacketar commented 7 years ago

Just a thought but you might add an additional field BEFORE the lock that's not used and mprotect it. At first glance mprotecting the opaque locks doesn't seem feasible but I haven't thought about it much.

PerilousApricot commented 7 years ago

The one hitch is that mrotect only works on page-aligned things. But, since that is a global, it should be easy to set a watch point.

It's dark in this basement.