[rqd] Implement systemd-based process recovery and improve fault-tolerant gRPC handling

ramonfigueiredo commented 1 month ago

Changes:

1) rqd/deploy/opencue-rqd.service:

Added automatic restart configuration for RQD using systemd:
- Restart=always ensures the service restarts after crashes or exits.
- RestartSec=5 adds a delay of 5 seconds before restarting the service.
Fixed the missing closing quotation for ExecStop in the service file.

2) rqd/rqd/rqconstants.py:

Added the constant RQD_GRPC_MAX_RETRIES = 5 to control the maximum number of gRPC reconnection attempts.

3) rqd/rqd/rqnetwork.py:

Improved gRPC connection handling with a retry mechanism and exponential backoff.
Changed the gRPC reconnection logic to log and increment the reconnection attempts counter before retrying.
Added handling for a "wedged" state when the maximum number of reconnection attempts is reached:
- Introduced handle_wedged_state() to log the issue, perform cleanup, and terminate the process with a non-zero exit code (sys.exit(1)).
- This termination allows the systemd service to automatically restart RQD.
Minor cleanup and addition of sys import to enable sys.exit() in error handling.

ramonfigueiredo commented 1 month ago

Currently, the problem is that RQD occasionally gets into a "wedged" state where it becomes unresponsive, preventing new frames from being booked to the host. This issue often goes unnoticed until complaints arise about slow booking or rendering inefficiencies. The wedged state can be caused by a loss of network connection between the render nodes and Cuebot. When RQD is wedged, commands like locking the host or rebooting it won't work as expected until RQD is manually restarted.

DiegoTavares commented 1 month ago

Some comments on the PR description:

I don't see exponential backoff as part of your implementation, I would not mention it.

rqd/rqd/rqnetwork.py: ... Improved gRPC connection handling with a retry mechanism and exponential backoff. ...

ramonfigueiredo commented 1 month ago

Some comments on the PR description:

I don't see exponential backoff as part of your implementation, I would not mention it.

rqd/rqd/rqnetwork.py: ... Improved gRPC connection handling with a retry mechanism and exponential backoff. ...

Comment updated!

ramonfigueiredo commented 1 month ago

A new fix is in the merged PR below.

[rqd] Fix rqd cache spill issue #1531

Closing this PR.

AcademySoftwareFoundation / OpenCue

[rqd] Implement systemd-based process recovery and improve fault-tolerant gRPC handling #1518