Resilient scheduler RPC

depombo commented 2 years ago

Replace hacky scheduler RPC with a proper one, and beef up child process error handling such that it either respawns the child process or kills the parent, too, on failure (so ECS can respawn everything)

mtp1376 commented 2 years ago

Trying to understand the overall flow of the IaSQL, I came up with the following diagram: IaSQL drawio

Basically we're using Graphile worker defined in the scheduler.ts to do the actual job, and we're using the iasql_operation table as an intermediate table to keep the output, the state, and the finish time for an operation. The until_iasql_operation function initiates the worker by using graphile_worker.add_job(...) and creates a while loop watching for the end_time column in iasql_operation table to wait for the result of the worker.

It might worth it to put these in some kind of documentation.

mtp1376 commented 2 years ago

Based on the following diagram:

Regarding "beef up child process error handling such that it either respawns the child process or kills the parent" part of the task definition:

The parent of the child process is the Express.js server, so if we kill the parent on error the whole container will be restarted. Is that ok?
I need some data on the previous failures that have happened on the child process. Like it has exited, etc. For example if it has exited before and we have failed to communicate with it through the RPC, then we might consider re-forking it

mtp1376 commented 2 years ago

One alternative approach might be to use a separate container to do the scheduler.ts's job. Currently, it is tightly coupled to the Express.js server and it cannot be scaled independent of it. If we separate it to its own container:

We can scale it independent of the API server
- We can use a mechanism like consistent-hashing to assign a set of databases to a single worker, but not all of the databases
We can keep using the RPC behavior, and maybe go for gRPC instead

dfellis commented 2 years ago

The parent of the child process is the Express.js server, so if we kill the parent on error the whole container will be restarted. Is that ok?

So respawning the child correctly is the ideal, but if the parent dies as well, at least Fargate will notice and tear down that container and fire up a new one, so it's "less bad" than the current situation where the child process can die and the engine no longer runs any graphile workers and all of the functions run forever.

I need some data on the previous failures that have happened on the child process. Like it has exited, etc. For example if it has exited before and we have failed to communicate with it through the RPC, then we might consider re-forking it

This has happened once in the past. We think an uncaught exception bubbled up to the top and killed it, but we aren't sure because the logging situation on the child process isn't as good as the parent process.

One alternative approach might be to use a separate container to do the scheduler.ts's job. Currently, it is tightly coupled to the Express.js server and it cannot be scaled independent of it. If we separate it to its own container:

Definitely, and something we have considered in the past for the reasons you mentioned (except for gRPC, which is probably overkill here?). The reasons why we haven't are basically:

We're too small for this to matter, and it's possible the system will scale in a different way than we expected. (Unlikely, I think, but perhaps the express server hits its limit before the graphile workers and then we have only a single scaling layer and some dividing strategy for the graphile workers; I'd personally go with rendezvous hashing over consistent hashing because we don't really need a gossip protocol when they all have the centralized metadata repo they can share info between each other.)
The longer we can keep the staging and production environments as similar to the local and CI environments, the fewer "gotchas" we'll hit in production that weren't caught during development and testing. So keeping it coupled like this reduces the possible failure scenarios.
Least important, but also a reason behind the decision. Our current deployment process with IaSQL can't exactly use IaSQL itself. We have plans to improve it (and soon) but changing the deployment at the moment is more complicated than it needs to be, so combine that with neither process being anywhere near the scaling limits, the testing issues, etc and it just hasn't been prioritized to scale these pieces independently.

mtp1376 commented 2 years ago

So respawning the child correctly is the ideal, but if the parent dies as well, at least Fargate will notice and tear down that container and fire up a new one, so it's "less bad" than the current situation where the child process can die and the engine no longer runs any graphile workers and all of the functions run forever.

The code flow does not suggest any "normal" exit of the child process. The only case that the child process might exit is an ungraceful one (like exit signals from the OS, memory leak, etc). For the respwning, I'm investigating this solution currently on the parent:

child?.on('exit', (...args) => {
  child = childProcess.fork(`${__dirname}/scheduler.js`);
})

This has happened once in the past. We think an uncaught exception bubbled up to the top and killed it, but we aren't sure because the logging situation on the child process isn't as good as the parent process.

Seems like all of the exceptions are kind of managed. It might've been something else (if the source code has not drastically changed).

mtp1376 commented 2 years ago

Regarding the use of JSON-RPC, there are some considerations:

You had mentioned multitransport-jsonrpc in the source code, but I think you're not actively maintaining it at the moment. Is it ok to use it?
I searched for the JSON-RPC implementation on Node.js and couldn't find a library with +100 stars. Maybe you know a good one?

Besides, this might be a good alternative to parent-child model:

Instead of using the current parent-child model, we can run the scheduler.ts as a background process in the Docker entrypoint. I mean having a Docker entrypoint that is leveraging a tool like forever to keep the scheduler.ts running.
We can use a RESTful solution for the RPC as well, instead of JSON-RPC. This won't fix the error-handling on parent, but we can put the error handling (and reporting) on the scheduler itself, as it's a partially-independent service.

dfellis commented 2 years ago

Regarding the use of JSON-RPC, there are some considerations:

You had mentioned multitransport-jsonrpc in the source code, but I think you're not actively maintaining it at the moment.

Yeah, I mentioned it as unfortunately dead, not that we should consider it our first option. I mentioned it because if we decide nothing else fits very well, I could revive that library? I am pretty sure I solved the parent-child error handling more thoroughly when I was maintaining it.

I searched for the implementation on Node.js and couldn't find a library with +100 stars. Maybe you know a good one?

Nope. The only community that seems to still care about JSON-RPC (based on mailing list activity) is the Java community. I also looked for a "better" JSON-RPC library than my own before I suggested mine, because I'd rather not divide my effort on maintaining a library that we only use for one of its features. (The JSON-RPC over Child Process API feature.)

This is another possible solution:

Instead of using the current parent-child model, we can run the scheduler.ts as a background process in the Docker entrypoint. I mean having a Docker entrypoint that is leveraging a tool like forever to keep the scheduler.ts running.

We could, though we'd need to figure out some other way for the two processes to talk to each other. Maybe making a FIFO/Named Pipe? It would preclude anyone from running this project on Windows, though that is not a major concern.

We can use a RESTful solution for the RPC as well, instead of JSON-RPC.

This may be best. We could have it running on a non-standard port (maybe 14527, vaguely IASQL (1 = I, 4 = A, 5 = S, 2 = q, 7 = L (upside down)) and then just use fetch in the scheduler-api.ts file. We already don't have automatic remote function discovery, so there's no real advantage to maintaining an RPC paradigm.

depombo commented 2 years ago

Have looked into https://nodejs.org/api/cluster.html or considered it? If so, curious why we ruled it out

dfellis commented 2 years ago

Have looked into https://nodejs.org/api/cluster.html or considered it? If so, curious why we ruled it out

Well, we only have one child and all of the same things we need to manually do for the single-child-process management also needs to be done there (and it has a similar message-passing structure). It just adds an extra layer without a benefit, afaict.

alantech / iasql

Resilient scheduler RPC #955