AcademySoftwareFoundation / OpenCue

A render management system you can deploy for visual effects and animation productions.
https://www.opencue.io
Apache License 2.0
832 stars 202 forks source link

[rqd] Implement systemd-based process recovery and improve fault-tolerant gRPC handling #1518

Closed ramonfigueiredo closed 1 month ago

ramonfigueiredo commented 1 month ago

Changes:

1) rqd/deploy/opencue-rqd.service:

2) rqd/rqd/rqconstants.py:

3) rqd/rqd/rqnetwork.py:

ramonfigueiredo commented 1 month ago

Currently, the problem is that RQD occasionally gets into a "wedged" state where it becomes unresponsive, preventing new frames from being booked to the host. This issue often goes unnoticed until complaints arise about slow booking or rendering inefficiencies. The wedged state can be caused by a loss of network connection between the render nodes and Cuebot. When RQD is wedged, commands like locking the host or rebooting it won't work as expected until RQD is manually restarted.

DiegoTavares commented 1 month ago

Some comments on the PR description:

I don't see exponential backoff as part of your implementation, I would not mention it.

rqd/rqd/rqnetwork.py: ... Improved gRPC connection handling with a retry mechanism and exponential backoff. ...

ramonfigueiredo commented 1 month ago

Some comments on the PR description:

I don't see exponential backoff as part of your implementation, I would not mention it.

rqd/rqd/rqnetwork.py: ... Improved gRPC connection handling with a retry mechanism and exponential backoff. ...

Comment updated!

ramonfigueiredo commented 1 month ago

A new fix is in the merged PR below.

[rqd] Fix rqd cache spill issue #1531

Closing this PR.