There is a race condition that can cause recv().wait() on an application thread to deadlock.
Repro steps
For a fairly minimal repro case, see the attached files, which are adapted from the "Blocking recv" example at https://frida.re/docs/messages/ . Running it should cause a deadlock after roughly 1 second.
I have also seen the same deadlock occur when running the testcase /GumJS/Script/recv_wait_should_not_leak#V8.
See also the pull request I am creating to fix this race condition, which includes its own specialized testcase.
If you have any difficulty reproducing it, I'll be glad to assist further.
Some diagnosis
The implementation of wait() starts in message-dispatcher.js:
this.wait = function wait() {
while (!completed) // <=== [LINE 1]
engine._waitForEvent();
};
If at [LINE 1] the requested message has already been received, wait() will return without needing to call down into engine._waitForEvent. Everything is fine in this circumstance.
If the requested message has not yet been received, execution continues in gumjs_wait_for_event (code shown for QJS):
At [LINE 2] the code obtains event_mutex, thereby locking in the value of core->event_count. Then it waits on event_cond. Subsequently, when an event posts, it will wake this thread. Everything is fine in this circumstance as well.
The trouble happens if the event posts after the check on [LINE 1] but before the mutex is obtained on [LINE 2]. In that case, the start_count observed on [LINE 3] winds up being the value of event_count after the event posts. Furthermore, g_cond_wait will hang, potentially forever, unless by luck some additional event (that should not have a bearing on this flow) happens to post.
Summary
There is a race condition that can cause
recv().wait()
on an application thread to deadlock.Repro steps
For a fairly minimal repro case, see the attached files, which are adapted from the "Blocking recv" example at https://frida.re/docs/messages/ . Running it should cause a deadlock after roughly 1 second.
Attachment: test1.zip
I have also seen the same deadlock occur when running the testcase
/GumJS/Script/recv_wait_should_not_leak#V8
.See also the pull request I am creating to fix this race condition, which includes its own specialized testcase.
If you have any difficulty reproducing it, I'll be glad to assist further.
Some diagnosis
The implementation of
wait()
starts inmessage-dispatcher.js
:If at
[LINE 1]
the requested message has already been received,wait()
will return without needing to call down intoengine._waitForEvent
. Everything is fine in this circumstance.If the requested message has not yet been received, execution continues in
gumjs_wait_for_event
(code shown for QJS):At
[LINE 2]
the code obtainsevent_mutex
, thereby locking in the value ofcore->event_count
. Then it waits onevent_cond
. Subsequently, when an event posts, it will wake this thread. Everything is fine in this circumstance as well.The trouble happens if the event posts after the check on
[LINE 1]
but before the mutex is obtained on[LINE 2]
. In that case, thestart_count
observed on[LINE 3]
winds up being the value ofevent_count
after the event posts. Furthermore,g_cond_wait
will hang, potentially forever, unless by luck some additional event (that should not have a bearing on this flow) happens to post.