The issue: It was experimentally discovered that if a plan executed by the worker process blocks the Run Engine event loop (e.g. gets stuck in an infinite loop) and an attempt is made to pause the plan by sending re_pause API request, the manager process fails in about 5 minutes after the attempt and begins an infinite sequence of continuous restarts.
Explanation: The handler for re_pause request is calling RE.request_pause() method to initiate the pause. The method a blocking function, which schedules execution of a coroutine and waits for the result of the respective future. Once the Run Engine event loop is blocked, the execution of the coroutine is blocked and the future never returns the results, which blocks execution of the re_pause handler. (Note, that it is assumed that communication between the manager and the worker is performed using request-reply protocol and the message handlers are simple and return almost immediately. It was overlooked that re_pause handler could be blocked). Since the messages are processed one by one in a loop, the blocked handler blocks processing loops and incoming messages (mostly periodic status requests) accumulate in the communication pipe until it overflows and blocks Communication.send() in the manager process, which block the event loop of the manager process. As the manager process is restarted (5 seconds after the event loop is blocked) the pipe remains full and Communication.send() blocks again causing another restart and so on. Unresponsive worker environment can always be killed using environment_destroy API, but it requires functioning manager process.
Changes in this PR:
Added a buffer (queue.Queue) for the received messages to PipeJsonRpcReceive class. The received message are loaded from the pipe and placed in the queue as soon as they arrive, so the pipe always remains empty. Once the capacity of the queue is reached, the new messages are discarded. In normal operation (request-reply), the queue should not contain more than one message at any given time, but it is possible that more messages are placed in the queue (requests may time out on the manager side and new requests could be sent). It is safe to assume that the queue overflow indicates a major malfunction (probably due to a bug) of the worker process, which needs to be restarted. This change prevents the manager process from being blocked due to blocked message processing loop in the worker, but it does not fix the issue with re_pause handler, which still blocks the message loop and makes the worker process unresponsive.
Modified re_pause message handler in the worker process. Instead of calling RE.request_pause(), the handler is placing the respective coroutine on the Run Engine event loop without waiting for the future result. This is non-blocking operation, and the message handler exits immediately even if the event loop is blocked. It is assumed that after issuing re_pause request, client scripts/applications wait for Run Engine to be in the paused state, so waiting for the results in the handler serves no purpose and just slows down the handler.
Summary of Changes for Release Notes
Fixed
Improved manager and worker stability in case of malfunctioning plans (plans that block Run Engine event loop).
The issue: It was experimentally discovered that if a plan executed by the worker process blocks the Run Engine event loop (e.g. gets stuck in an infinite loop) and an attempt is made to pause the plan by sending
re_pause
API request, the manager process fails in about 5 minutes after the attempt and begins an infinite sequence of continuous restarts.Explanation: The handler for
re_pause
request is callingRE.request_pause()
method to initiate the pause. The method a blocking function, which schedules execution of a coroutine and waits for the result of the respective future. Once the Run Engine event loop is blocked, the execution of the coroutine is blocked and the future never returns the results, which blocks execution of there_pause
handler. (Note, that it is assumed that communication between the manager and the worker is performed using request-reply protocol and the message handlers are simple and return almost immediately. It was overlooked thatre_pause
handler could be blocked). Since the messages are processed one by one in a loop, the blocked handler blocks processing loops and incoming messages (mostly periodic status requests) accumulate in the communication pipe until it overflows and blocksCommunication.send()
in the manager process, which block the event loop of the manager process. As the manager process is restarted (5 seconds after the event loop is blocked) the pipe remains full andCommunication.send()
blocks again causing another restart and so on. Unresponsive worker environment can always be killed usingenvironment_destroy
API, but it requires functioning manager process.Changes in this PR:
Added a buffer (
queue.Queue
) for the received messages toPipeJsonRpcReceive
class. The received message are loaded from the pipe and placed in the queue as soon as they arrive, so the pipe always remains empty. Once the capacity of the queue is reached, the new messages are discarded. In normal operation (request-reply), the queue should not contain more than one message at any given time, but it is possible that more messages are placed in the queue (requests may time out on the manager side and new requests could be sent). It is safe to assume that the queue overflow indicates a major malfunction (probably due to a bug) of the worker process, which needs to be restarted. This change prevents the manager process from being blocked due to blocked message processing loop in the worker, but it does not fix the issue withre_pause
handler, which still blocks the message loop and makes the worker process unresponsive.Modified
re_pause
message handler in the worker process. Instead of callingRE.request_pause()
, the handler is placing the respective coroutine on the Run Engine event loop without waiting for the future result. This is non-blocking operation, and the message handler exits immediately even if the event loop is blocked. It is assumed that after issuingre_pause
request, client scripts/applications wait for Run Engine to be in the paused state, so waiting for the results in the handler serves no purpose and just slows down the handler.Summary of Changes for Release Notes
Fixed
Added
Changed
Removed
How Has This Been Tested?