Closed josephjclark closed 2 months ago
I have managed to repro locally.
I don't really think it matters what the error is. In my case, I'm passing "ssl": true
and my postgres fails because there's no cert.
But the symptom is the same: the workflow errors, logs start timing out, and the workerpool does not relinquish the child process.
In production this afternoon we had a nasty case of worker death
Link to logs
A couple of things to note:
Suspicion 1: Child processes are dying and not being recreated (but maybe they;'re managing to send out an error to the worker so that the queue is getting freed up)
Suspicion 2: Postgres is erroring and calling process.exit(), and the exit is not properly handled