Open gbhrdt opened 6 years ago
Are you restarting the goworker application? We see this behaviour when the application is hard-stopped and doesn't have time to cleanup the records in Redis (workers
and worker:[node]
keys).
@mingan Yes, sometimes we are re-deploying the Docker containers when jobs are still running, that might be the cause. I think we should definitely cleanup when starting up again.
Edit:
node-resque
does something like this to cleanup:
const shutdown = async () => {
await scheduler.end();
await worker.end();
process.exit();
};
process.on('SIGTERM', shutdown);
process.on('SIGINT', shutdown);
Yeah, the problem is to figure out a safe mechanism to do so and keep it compatible with the Resque gem.
We don't have any disadvantage from those jobs other than memory consumption from Redis, right? So concurrency still works as expected and the stuck jobs are not being considered by goworker
anymore?
If it's the same issue we have experienced, there are extra values in the set of workers and the dead workers appear to still be working in the UI (there are records under the given prefix). I'm not sure if the jobs themselves are failed or abandoned, that might be an issue.
There's similar code in goworker https://github.com/benmanns/goworker/blob/master/signals.go which stops polling and stops idle workers. I don't remember it correctly and don't have time to look it up at the moment but I think it doesn't force a running worker to stop so unless it finishes normally, it might hang.
This logic should be added like it's on the "main" Resque: https://github.com/resque/resque/blob/master/lib/resque/worker.rb#L599
Which basically consists on having a heartbeat
and a prune
function when the worker is started which will expire old workers.
I'll try to work on this and add it to the lib, would this be something that would be merged if implemented? (cc @benmanns)
It can happen that workers get stuck silently. We were using
node-resque
worker before, which handled this scenario very well. Withgoworker
, jobs just keep shown as running inresque-web
. Also, the worker count inresque-web
keeps increasing (should be 4).