Open d-netto opened 1 day ago
Yes, good catch, there appears to be a missing re-assignment of old = -1;
at the end of that loop which means in the ABA case, we accidentally actually acquire the lock on the thread despite not actually having stopped the thread; or in the counter-case, we try to run through this logic with old==-1
on the next iteration, and that isn't valid either (jl_thread_suspend_and_get_state should return failurea and the loop with abort too early)
We've been using
jl_record_backtrace
extensively in production to print backtraces when one of our servers hits a state of degraded performance or deadlock.We're afraid, however, that there could be a ABA-like bug in
jl_record_backtrace
which may lead us to miss a task for which we're attempting to record the backtrace.Here is the code for reference:
The case which I think may be pathological is the following:
t
is initially scheduled on thread 1. The compare-and-swap at thewhile
loop will fail andold
will get a value of 1.jl_thread_suspend_and_get_state
, but let's say that taskt
was faster than us and got rescheduled in thread 2.t->tid == 1
fails, and we will resume thread 1.t
is again faster than us, and very quickly got migrated from thread 2 to thread 1.t->tid == 1
, andold == 1
. The compare-and-swap succeeds even though this task is still running in a thread, which will cause us to leave early from thewhile
loop.Is the case I'm describing here indeed pathological? Are there any invariants that I could be missing that will make this scenario impossible to occur?
Thanks in advance.
CC: @vtjnash.