bugfix: clear work-stealing queue after returning to scheduler

shwestrick commented 9 months ago

Stumbled on this today.

Context: we use the work-stealing queue as additional roots for collections and use top/bottom indices of the queue to control local scopes for hierarchical collections. Whenever a processor transitions from working to stealing, we need to "reset" the queue, which involves two updates:

the contents of the queue need to be cleared (i.e. set to NONE/NULL), so that we do not continue to use the old contents as additional roots, and
the top/bottom indices need to be set to 1, matching the scheduler thread's perspective on the heap hierarchy.

For (1), the subtlety is that we use an ABP deque. With this kind of deque, stealing doesn't actually modify the underlying array. (There is a whole world of concurrency pain if you try to do so.) So, instead, we need to find safe points where we can modify the entire underlying array to clear out old elements.

The gist of this patch: There appears to have been an edge case where the scheduler did not clear the work-stealing queue appropriately. This patch fixes that: now, when transitioning from working to stealing and returning to the scheduler, we unconditionally do a full clear of the queue. (It's possible that this causes double-clears to happen. But the queues aren't that big. And, the number of double-clears will be bounded by the number of successful steals, which in general is fairly small.)

This seems to fix #156, which is pleasantly surprising. What a long-standing bug! Need to do a bit more testing first to be sure.

Side note: this is subtle. One way of reasoning about this is to think of the work-stealing queues as being associated with leaf tasks. I.e., imagine allocating a fresh queue at every successful steal and throwing away the current queue whenever a processor makes the transition from working to stealing. In this perspective, resetting the queue is equivalent to throwing away the old queue and then allocating a fresh queue in its place. Typical work-stealing implementations "optimize" this by skipping the deallocation/reallocation, and just reuse the old queue as the new queue. And, this "optimization" is really straightforward for typical work-stealing implementations, because reusing the queue just... works.

However, in our setting, we can't just reuse the queue with no additional work, because the state of the queue is used additionally for hierarchical heap management. We have to be more careful to properly reset the queue before the queue can be reused.

shwestrick commented 9 months ago

One thing I want to look into, before merging: perhaps the clear() should instead always be done immediately before calling returnToSched()? This would help eliminate double-clears. But more importantly, I'm thinking about what could happen if we get unlucky and a local GC occurs immediately after switching to the scheduler thread. The clear() by the scheduler thread hasn't happened yet, which is not correct.

It's possible that we may need to extend the thread switching logic to handle this properly. When switching away from a thread, we can record (in the thread object) whether or not deque maintenance is needed upon resuming the thread, and also what that deque maintenance should be, e.g., clearing the deque contents and adjusting the top/bottom indices. Then GC_switchToThread can do this work at the appropriate time, guaranteeing that a GC cannot occur at the wrong moment.

shwestrick commented 4 months ago

This is no longer needed -- a better fix was implemented in d1646cf and merged as part of #180.

MPLLang / mpl

bugfix: clear work-stealing queue after returning to scheduler #178