We've observed cases of workers appearing to stall out and not pick up
new work, even though eligible jobs are present (with "eligibility"
determined using the same functions the workers do). The current running
hypothesis is that one or more internal Overseer components are
experiencing errors, causing the system to come to a halt, with no
visibility into those errors.
This instruments the executor and ready-job-detector with a new
exception handler that will log any errors locally, then to Sentry, and
then fatally shut down the entire process (as errors in these stages are
irrecoverable framework errors). Note that the heartbeat process already
has its own error detection and (configurable)shutdown logic; no attempt
is made here to unify these things.
Interesting considerations RE making workers consistent with the instrumented future programming model. Will run the experiment for some time and revisit if we want to pursue this route or revert the change.
We've observed cases of workers appearing to stall out and not pick up new work, even though eligible jobs are present (with "eligibility" determined using the same functions the workers do). The current running hypothesis is that one or more internal Overseer components are experiencing errors, causing the system to come to a halt, with no visibility into those errors.
This instruments the executor and ready-job-detector with a new exception handler that will log any errors locally, then to Sentry, and then fatally shut down the entire process (as errors in these stages are irrecoverable framework errors). Note that the heartbeat process already has its own error detection and (configurable)shutdown logic; no attempt is made here to unify these things.