OpenFn / kit

The bits & pieces that make OpenFn work. (diagrammer, cli, compiler, runtime, runtime manager, logger, etc.)
8 stars 12 forks source link

Worker Death #664

Closed josephjclark closed 2 months ago

josephjclark commented 2 months ago

In production this afternoon we had a nasty case of worker death

image

Link to logs

A couple of things to note:

Suspicion 1: Child processes are dying and not being recreated (but maybe they;'re managing to send out an error to the worker so that the queue is getting freed up)

Suspicion 2: Postgres is erroring and calling process.exit(), and the exit is not properly handled

josephjclark commented 2 months ago

I have managed to repro locally.

I don't really think it matters what the error is. In my case, I'm passing "ssl": true and my postgres fails because there's no cert.

But the symptom is the same: the workflow errors, logs start timing out, and the workerpool does not relinquish the child process.