Timeout for stuck workers

benmanns / goworker

goworker is a Go-based background worker that runs 10 to 100,000* times faster than Ruby-based workers.

https://www.goworker.org

Other

2.8k stars 241 forks source link

Timeout for stuck workers #65

Open gbhrdt opened 6 years ago

gbhrdt commented 6 years ago

It can happen that workers get stuck silently. We were using node-resque worker before, which handled this scenario very well. With goworker, jobs just keep shown as running in resque-web. Also, the worker count in resque-web keeps increasing (should be 4).

mingan commented 6 years ago

Are you restarting the goworker application? We see this behaviour when the application is hard-stopped and doesn't have time to cleanup the records in Redis (workers and worker:[node] keys).

gbhrdt commented 6 years ago

@mingan Yes, sometimes we are re-deploying the Docker containers when jobs are still running, that might be the cause. I think we should definitely cleanup when starting up again.

Edit: node-resque does something like this to cleanup:

const shutdown = async () => {
    await scheduler.end();
    await worker.end();
    process.exit();
  };

  process.on('SIGTERM', shutdown);
  process.on('SIGINT', shutdown);

mingan commented 6 years ago

Yeah, the problem is to figure out a safe mechanism to do so and keep it compatible with the Resque gem.

gbhrdt commented 6 years ago

We don't have any disadvantage from those jobs other than memory consumption from Redis, right? So concurrency still works as expected and the stuck jobs are not being considered by goworker anymore?

mingan commented 6 years ago

If it's the same issue we have experienced, there are extra values in the set of workers and the dead workers appear to still be working in the UI (there are records under the given prefix). I'm not sure if the jobs themselves are failed or abandoned, that might be an issue.

There's similar code in goworker https://github.com/benmanns/goworker/blob/master/signals.go which stops polling and stops idle workers. I don't remember it correctly and don't have time to look it up at the moment but I think it doesn't force a running worker to stop so unless it finishes normally, it might hang.

xescugc commented 3 years ago

This logic should be added like it's on the "main" Resque: https://github.com/resque/resque/blob/master/lib/resque/worker.rb#L599

Which basically consists on having a heartbeat and a prune function when the worker is started which will expire old workers.

I'll try to work on this and add it to the lib, would this be something that would be merged if implemented? (cc @benmanns)