Pick and implement a background job strategy

humphd commented 1 year ago

Starchart has a number of long-running processes to accomplish. For example:

Create a DNS record in Route53 and know when it is ready to be used
Order certificates from Let's Encrypt, doing DNS challenges, updating the database, etc.
Regularly find and remove expired DNS records and certificates from Route53 and MySQL
Regularly find DNS records and certificates that are going to expire and notify users by email

We also have to deal with asynchronous processes failing (network issues, rate limiting, etc) and retry jobs (e.g., exponential backoff) until they succeed.

None of the tasks we need to run in the background is CPU intensive. All of our jobs are network- vs. compute-constrained; that is, we are going to be waiting on API calls to Route53, MySQL, Let's Encrypt, or Exchange. Furthermore, we are going to run in Docker. Docker containers are heavily biased toward a single-process model. This helps us in picking our background job architecture.

Node has a number of recommended strategies for handling background jobs:

Use Promises. We could run many long-live promises in the main app. Only the memory held by unresolved promises becomes an issue.
Use one or more child processes. We could put background work in separate processes outside of the main app.
Use worker threads. We could create a pool of worker threads to handle our jobs
Use a worker queue. The most popular option for doing this in node is BullMQ backed by Redis.

If we go with 1., we gain simplicity in the success path. It's fairly easy to reason about Promises, and they don't introduce any new dependencies. However, we'd start to incur extra complexity when we got into adding scheduled jobs, repeating failed jobs, and scaling horizontally (i.e., no shared state between instances).

If we go with 2., we would have to expand outside of a single container, running the child process(es) in separate containers. This adds some complexity, but is a reasonable approach. It still doesn't solve the scheduling problem discussed above.

The point of using option 3. would be to help with CPU bound operations, where we need to expand to use more cores. Our problem isn't really CPU, so I'm not sure this is a good fit. Nor does it solve the scheduling problem.

The final option 4 would solve the scheduling and scaling issues. The job queue(s) would be managed in a single shared Redis instance, and we could run as many worker processes in their own containers as we need to service the load.

This video is a really useful introduction to how Bull + Redis works, and what you can do with it:

https://www.youtube.com/watch?v=wAEMXVcRbgU

With Bull + Redis, we could do all of the following:

Run long-lived processes (DNS, Certs, Notifications, etc) by putting them in a queue
Retry failed jobs in a smart way
Run repeatable jobs on a schedule (e.g., delete all expired DNS records every night at 2 am)
Create complex job flows, which have multiple steps (e.g., Certs)

Plus, we have experience using this technology in Telescope, where Bull runs our parser service.

There might be other options we should explore. I know that there are many other queue technologies out there; however, many add a lot more complexity than we need. Maybe there is something simpler that I'm not considering?

ghost commented 1 year ago

My issue with 1, 2 and 3, that they don't persist. If we do a deploy, pre-existing jobs will be lost, processes will remain in a limbo state.

So my stance on this is: Either use the redis backed queue engine you suggested in 4, or do a BG process that grabs tasks from a MySQL table. I have never worked with BullMQ, but I suspect that it might limit the complexity compared to my other possible solution

humphd commented 1 year ago

I agree with you. I think implementing our own queue logic is going to become too big of a project.

Why don't we try using BullMQ + Redis, and see how it goes. The queue/worker logic doesn't add much code to our project.

Did you want me to implement it in our project, or are you (or someone else) interested in doing the work? It might be nice to land the queue/worker code in 0.2 so we can use it in 0.3.

humphd commented 1 year ago

Fixed by https://github.com/Seneca-CDOT/starchart/pull/161

DevelopingSpace / starchart

Pick and implement a background job strategy #145