Closed al-niessner closed 1 year ago
Turns out there is an error in farm.rerunid() which is called from farm.dispatch() that then terminates the twisted LoopingCall(). The error is rooted in a postgres call in this situation and is thus a postgres error.
There are a lot of jobs in the doing list but all workers are idle. It is following the running of jobs of a task that the doing list depends upon. It appears to be a problem with the farm and less the scheduler.
The number of idle workers declines when they are removed. It goes to 0 meaning no lost or botched up worker counts.
Restarting workers does not cause jobs to flow out of doing.
Adding jobs before, at, or after the stuck tasks in doing has no effect.
There are no tasks of the stuck kind in success for failure so it has never run.
All in all, it seems that the scheduler thinks there is a doing list, but that the farm as cleared its list including, potentially, all of the stuff that was added but never run.
More concrete: target.scrape ran many but not all targets. The targets still in doing have neither success nor failure. How could they have been lost out of farm's job queue then?