OpenFn / kit

The bits & pieces that make OpenFn work. (diagrammer, cli, compiler, runtime, runtime manager, logger, etc.)
9 stars 9 forks source link

Worker: report a good clear error if a websocket message timesout #764

Open josephjclark opened 1 week ago

josephjclark commented 1 week ago

If any worker -> lightning message times out on the websocket (ie because it took 10 seconds to reply), the run right now will be lost.

We can do better than this! We should surely be able to report the timeout somwhere, or continue retrying.

We may need help on the lightning side to recognise that message responses are slow.

Everyone will understand if the system is under load and running slow - so long as the work does get done eventually.

Probably the answer here is just to retry the message, or backoff and retry.

josephjclark commented 1 week ago

TD +1 on retry forever

josephjclark commented 1 week ago

TD maybe bump the claim queue backoff of something when messages start timing out.

So if you only have 1 job in progress, stop sending claim requests to lightning. Because lightning is busy! So back off and let the work finish.

I wonder if this is something like: take the average lightning reply time, and if that exceeds some threshold, multiply the claim backoff by it. In a trivial case, if the average message round trip is 9 seconds, then your backoff is +9 seconds.

That would help decrease load when Lightning is struggling and reduce the chance of lost runs.