bluesky / bluesky-queueserver

Server for queueing plans
https://blueskyproject.io/bluesky-queueserver/
BSD 3-Clause "New" or "Revised" License
11 stars 20 forks source link

Manager and worker stability in cases of blocked Run Engine event loop #261

Closed dmgav closed 1 year ago

dmgav commented 1 year ago

The issue: It was experimentally discovered that if a plan executed by the worker process blocks the Run Engine event loop (e.g. gets stuck in an infinite loop) and an attempt is made to pause the plan by sending re_pause API request, the manager process fails in about 5 minutes after the attempt and begins an infinite sequence of continuous restarts.

Explanation: The handler for re_pause request is calling RE.request_pause() method to initiate the pause. The method a blocking function, which schedules execution of a coroutine and waits for the result of the respective future. Once the Run Engine event loop is blocked, the execution of the coroutine is blocked and the future never returns the results, which blocks execution of the re_pause handler. (Note, that it is assumed that communication between the manager and the worker is performed using request-reply protocol and the message handlers are simple and return almost immediately. It was overlooked that re_pause handler could be blocked). Since the messages are processed one by one in a loop, the blocked handler blocks processing loops and incoming messages (mostly periodic status requests) accumulate in the communication pipe until it overflows and blocks Communication.send() in the manager process, which block the event loop of the manager process. As the manager process is restarted (5 seconds after the event loop is blocked) the pipe remains full and Communication.send() blocks again causing another restart and so on. Unresponsive worker environment can always be killed using environment_destroy API, but it requires functioning manager process.

Changes in this PR:

Summary of Changes for Release Notes

Fixed

Added

Changed

Removed

How Has This Been Tested?