Closed ramonfigueiredo closed 1 month ago
Currently, the problem is that RQD occasionally gets into a "wedged" state where it becomes unresponsive, preventing new frames from being booked to the host. This issue often goes unnoticed until complaints arise about slow booking or rendering inefficiencies. The wedged state can be caused by a loss of network connection between the render nodes and Cuebot. When RQD is wedged, commands like locking the host or rebooting it won't work as expected until RQD is manually restarted.
Some comments on the PR description:
I don't see exponential backoff as part of your implementation, I would not mention it.
rqd/rqd/rqnetwork.py: ... Improved gRPC connection handling with a retry mechanism and exponential backoff. ...
Some comments on the PR description:
I don't see exponential backoff as part of your implementation, I would not mention it.
rqd/rqd/rqnetwork.py: ... Improved gRPC connection handling with a retry mechanism and exponential backoff. ...
Comment updated!
A new fix is in the merged PR below.
[rqd] Fix rqd cache spill issue #1531
Closing this PR.
Changes:
1) rqd/deploy/opencue-rqd.service:
Restart=always
ensures the service restarts after crashes or exits.RestartSec=5
adds a delay of 5 seconds before restarting the service.ExecStop
in the service file.2) rqd/rqd/rqconstants.py:
RQD_GRPC_MAX_RETRIES = 5
to control the maximum number of gRPC reconnection attempts.3) rqd/rqd/rqnetwork.py:
handle_wedged_state()
to log the issue, perform cleanup, and terminate the process with a non-zero exit code (sys.exit(1)
).sys
import to enablesys.exit()
in error handling.