caetanosauer / zero

Fork of the Shore-MT storage manager used by the research project Instant Recovery
Other
29 stars 10 forks source link

Log buffer consolidation array (Aether) concurrency bug #22

Closed caetanosauer closed 8 years ago

caetanosauer commented 8 years ago

Following assertion fails under high contention:

(In method ConsolidationArray::replace_active_slot) assertion failure: _active_slots[active_index]->vthis()->count > SLOT_AVAILABLE

  1. error in /home/csauer/hp/zero/src/sm/log_carray.cpp:190 Assertion failed

[I've encountered this bug a long time ago -- adding issue now because I'm about to fix it and I would like to document it]

caetanosauer commented 8 years ago

Below is what I think is causing the bug -- will attempt fix later.

In order to join a CArray slot at log insertion, the inserting thread must first find an available slot at random. This search, together with all slot management functions, is performed without concurrency control under the assumption that all races are resolved with atomic operations on the slot values themselves. However, the following race condition has been overlooked.

Finding a random slot for joining happens in a nested loop (method ConsolidationArray::join_slot), where the outer loop grabs an available slot (counter >= 0) and the inner loop attempts to increment the counter and fetch the old value with a CAS. If the inner loop fails on the CAS, two outcomes are possible:

(1) the new value is negative, which means slot is not available anymore -- break to outer loop to find a new slot; or (2) the new value is still positive, which means the slot is still available but another thread incremented before us -- continue in the inner loop

In case 2 above, it may happen that the slot is available but it has actually gone through a whole grab-release cycle, so that despite the slot being valid, the index on the slot array has changed. This is similar to the classical ABA problem. The bug happens because we return the index to the caller, which uses it in replace_active_slot instead of the pointer to the slot.

Possible solutions I can think of are:

caetanosauer commented 8 years ago

Fixed bug using second option above -- just use pointer to slot instead of pointer+index. The minor drawback is that replace_active_slot now has to search the small array of active slot pointers for a match, whereas it would know exactly which pointer to replace using the index number. But this should be a few instructions only -- so no reason to worry.

Fix can be verified by running the stress test "stress_carray" with many threads and a long time, which after this commit works without any failing assertions.

caetanosauer commented 8 years ago

Saw bug happening again... so reopening

assertion failure: old_count != 0 || _active_slots[idx] == info

  1. error in /home/csauer/hp/zero/src/sm/log_carray.cpp:112 Assertion failed called from: 0) /home/csauer/hp/zero/src/sm/log_carray.cpp:112