fault tolerance - Githubissues

SEAPUNK commented 8 years ago

Here are a few cases I need to figure out:

Scenario 1: Server is started with a persistent backend, runners and jobs are created, server abruptly crashes, and then restarts immediately after.

Scenario 1 questions:

Scenario 2: Server is started, runners and jobs added, server abruptly crashes, but does not get restarted.

Scenario 2 questions:

How do the clients handle this?
What do the runners do when they cannot connect to the server, and as a result, cannot send job responses?

SEAPUNK commented 8 years ago

With the system that is the close reason for #7, I can answer this pretty easily:

Scenario 1:

Yes. There are two timeouts: Job timeout (max time a job can run, period), and "client" timeout (max time a client can be unresponsive (may it be not querying the server in a timely fashion, or client responding to server's transport query in a timely fashion (if we do something like a websocket transport)) before we can consider it to be dead). Those timeouts determine external job failure.
Yes. We store the runner's last time it contacted the server, and that time is used to calculate whether the runner timed out. In the case of a server failure, the runner timeout is reset, in the sense of the calculation for runner timeout is (now - Math.max(lastRunnerMessage, serverCreated)) > timeout

SEAPUNK commented 8 years ago

Scenario 2:

Clients just emit an "error" event for that handler every time checking for status fails. Job timeout checking should be done by the server, to maintain consistency. The user can implement custom logic for error handling.
Runners build an indefinitely sized buffer of messages to send to the server. Runners can run out of memory and crash if the buffer is too big. (Runners should be separate processes anyway, as they can crash and etc.)

SEAPUNK commented 8 years ago

I think this answers most of the fault tolerance questions.

SEAPUNK / jobber