Snaipe / Criterion

A cross-platform C and C++ unit testing framework for the 21st century
MIT License
1.94k stars 176 forks source link

core: fix deadlock in overloaded scenarios #483

Closed MrAnno closed 1 year ago

MrAnno commented 1 year ago

Root cause: client_ctx::alive handling and BoxFort timeout conflict

To indicate that a client is dead, a death message has to be sent to the runner, and that message needs to be the very last message. This is done correctly.

The problem is handle_birth(), which sets the alive flag to true. Under rare circumstances (overloaded system), the handle_birth() action may never run due to the test timeout that occurs before sending out the birth message. This scenario leaves the alive flag on false, causing a deadlock:

-- thread 1
run_tests_async():
  while (read_message()):
    process_message()
    if not alive -> remove_client() -> destroy_client_context() -> bxf_wait() -> pthread_cond_wait()

-- thread 2
child_pump_fn():
  reap_child() -> death_callback() -> cr_send_to_runner() * 2 -> nn_recv()
  ...
  pthread_cond_broadcast(&instance->cond);

The death callback contains a cr_send_to_runner() invocation, which waits for the main message loop to send an ack, which will never happen.

MrAnno commented 1 year ago

@Snaipe I think with this PR and #474, we should prepare v2.4.2.