frida / frida-gum

Cross-platform instrumentation and introspection library written in C
https://frida.re
Other
753 stars 245 forks source link

recv wait can deadlock on an application thread #807

Open HexKitchen opened 3 months ago

HexKitchen commented 3 months ago

Summary

There is a race condition that can cause recv().wait() on an application thread to deadlock.

Repro steps

For a fairly minimal repro case, see the attached files, which are adapted from the "Blocking recv" example at https://frida.re/docs/messages/ . Running it should cause a deadlock after roughly 1 second.

Attachment: test1.zip

I have also seen the same deadlock occur when running the testcase /GumJS/Script/recv_wait_should_not_leak#V8.

See also the pull request I am creating to fix this race condition, which includes its own specialized testcase.

If you have any difficulty reproducing it, I'll be glad to assist further.

Some diagnosis

The implementation of wait() starts in message-dispatcher.js:

  this.wait = function wait() {
    while (!completed)                    // <=== [LINE 1]
      engine._waitForEvent();
  };

If at [LINE 1] the requested message has already been received, wait() will return without needing to call down into engine._waitForEvent. Everything is fine in this circumstance.

If the requested message has not yet been received, execution continues in gumjs_wait_for_event (code shown for QJS):

...
  g_mutex_lock (&self->event_mutex);     // <=== [LINE 2]

  start_count = self->event_count;       // <=== [LINE 3]
  while (self->event_count == start_count && self->event_source_available)
  {
    ...
      g_cond_wait (&self->event_cond, &self->event_mutex);
    ...
  }

At [LINE 2] the code obtains event_mutex, thereby locking in the value of core->event_count. Then it waits on event_cond. Subsequently, when an event posts, it will wake this thread. Everything is fine in this circumstance as well.

The trouble happens if the event posts after the check on [LINE 1] but before the mutex is obtained on [LINE 2]. In that case, the start_count observed on [LINE 3] winds up being the value of event_count after the event posts. Furthermore, g_cond_wait will hang, potentially forever, unless by luck some additional event (that should not have a bearing on this flow) happens to post.