Closed depombo closed 2 years ago
Trying to understand the overall flow of the IaSQL, I came up with the following diagram:
Basically we're using Graphile worker defined in the scheduler.ts
to do the actual job, and we're using the iasql_operation
table as an intermediate table to keep the output, the state, and the finish time for an operation.
The until_iasql_operation
function initiates the worker by using graphile_worker.add_job(...)
and creates a while loop watching for the end_time
column in iasql_operation
table to wait for the result of the worker.
It might worth it to put these in some kind of documentation.
Based on the following diagram:
Regarding "beef up child process error handling such that it either respawns the child process or kills the parent" part of the task definition:
One alternative approach might be to use a separate container to do the scheduler.ts
's job. Currently, it is tightly coupled to the Express.js server and it cannot be scaled independent of it. If we separate it to its own container:
The parent of the child process is the Express.js server, so if we kill the parent on error the whole container will be restarted. Is that ok?
So respawning the child correctly is the ideal, but if the parent dies as well, at least Fargate will notice and tear down that container and fire up a new one, so it's "less bad" than the current situation where the child process can die and the engine no longer runs any graphile workers and all of the functions run forever.
I need some data on the previous failures that have happened on the child process. Like it has exited, etc. For example if it has exited before and we have failed to communicate with it through the RPC, then we might consider re-forking it
This has happened once in the past. We think an uncaught exception bubbled up to the top and killed it, but we aren't sure because the logging situation on the child process isn't as good as the parent process.
One alternative approach might be to use a separate container to do the scheduler.ts's job. Currently, it is tightly coupled to the Express.js server and it cannot be scaled independent of it. If we separate it to its own container:
Definitely, and something we have considered in the past for the reasons you mentioned (except for gRPC, which is probably overkill here?). The reasons why we haven't are basically:
So respawning the child correctly is the ideal, but if the parent dies as well, at least Fargate will notice and tear down that container and fire up a new one, so it's "less bad" than the current situation where the child process can die and the engine no longer runs any graphile workers and all of the functions run forever.
The code flow does not suggest any "normal" exit of the child process. The only case that the child process might exit is an ungraceful one (like exit signals from the OS, memory leak, etc). For the respwning, I'm investigating this solution currently on the parent:
child?.on('exit', (...args) => {
child = childProcess.fork(`${__dirname}/scheduler.js`);
})
This has happened once in the past. We think an uncaught exception bubbled up to the top and killed it, but we aren't sure because the logging situation on the child process isn't as good as the parent process.
Seems like all of the exceptions are kind of managed. It might've been something else (if the source code has not drastically changed).
Regarding the use of JSON-RPC, there are some considerations:
multitransport-jsonrpc
in the source code, but I think you're not actively maintaining it at the moment. Is it ok to use it?Besides, this might be a good alternative to parent-child model:
scheduler.ts
as a background process in the Docker entrypoint. I mean having a Docker entrypoint that is leveraging a tool like forever
to keep the scheduler.ts
running.Regarding the use of JSON-RPC, there are some considerations:
You had mentioned
multitransport-jsonrpc
in the source code, but I think you're not actively maintaining it at the moment.
Yeah, I mentioned it as unfortunately dead, not that we should consider it our first option. I mentioned it because if we decide nothing else fits very well, I could revive that library? I am pretty sure I solved the parent-child error handling more thoroughly when I was maintaining it.
I searched for the implementation on Node.js and couldn't find a library with +100 stars. Maybe you know a good one?
Nope. The only community that seems to still care about JSON-RPC (based on mailing list activity) is the Java community. I also looked for a "better" JSON-RPC library than my own before I suggested mine, because I'd rather not divide my effort on maintaining a library that we only use for one of its features. (The JSON-RPC over Child Process API feature.)
This is another possible solution:
Instead of using the current parent-child model, we can run the
scheduler.ts
as a background process in the Docker entrypoint. I mean having a Docker entrypoint that is leveraging a tool likeforever
to keep thescheduler.ts
running.
We could, though we'd need to figure out some other way for the two processes to talk to each other. Maybe making a FIFO/Named Pipe? It would preclude anyone from running this project on Windows, though that is not a major concern.
We can use a RESTful solution for the RPC as well, instead of JSON-RPC.
This may be best. We could have it running on a non-standard port (maybe 14527, vaguely IASQL (1 = I, 4 = A, 5 = S, 2 = q, 7 = L (upside down)) and then just use fetch
in the scheduler-api.ts
file. We already don't have automatic remote function discovery, so there's no real advantage to maintaining an RPC paradigm.
Have looked into https://nodejs.org/api/cluster.html or considered it? If so, curious why we ruled it out
Have looked into https://nodejs.org/api/cluster.html or considered it? If so, curious why we ruled it out
Well, we only have one child and all of the same things we need to manually do for the single-child-process management also needs to be done there (and it has a similar message-passing structure). It just adds an extra layer without a benefit, afaict.
Replace hacky scheduler RPC with a proper one, and beef up child process error handling such that it either respawns the child process or kills the parent, too, on failure (so ECS can respawn everything)