StateVoicesNational / Spoke

mass-contact text/SMS distribution tool
Other
466 stars 410 forks source link

Use `kue` instead of jobs table #526

Closed ben-pr-p closed 3 years ago

ben-pr-p commented 6 years ago

My understanding of the current setup is that right now, asynchronous jobs are accomplished via a jobs table, and there is a separate process (or separate Lambda event) to go and process them every 5 minutes (or how often one configures their lambda events).

Advantages of dropping the Postgres solution and moving to Redis (kue)[http://automattic.github.io/kue/]:

schuyler1d commented 6 years ago

When JOBS_SAME_PROCESS=1 most of the 'jobs' don't need to be a job at all. Maybe that should be the first step?

If we use Redis, I would want to make sure it could be optional (i.e. ok to stay in the DB, by default) and independent of the intended Redis caching server intended for the scaling work, since the long-running jobs (csv processing) deal with large storage and I wouldn't want that to make the caching size unstable.

From MoveOn's perspective the most important job to track is actually the datawarehouse version of the contact upload -- everything else could just be synchronous (though the regular csv contact upload might be usefully async).

Are you planning on expanding background tasks and want a proper system for it, or are you focusing on the current ones?

ben-pr-p commented 6 years ago

Both – I think developing / modifying the current jobs, understanding what the current jobs do, and adding new jobs would be easier with a external job management queue library than the in-house solution that exists right now (we would also get a bunch of features for free that we'll probably want in the future).

Some things we can improve: * Clarity in code around where jobs are and which function does what job

Some solutions / options – (kue)[https://github.com/Automattic/kue] and (bull)[https://github.com/OptimalBits/bull] are the most feature rich, but I do agree with keeping Redis optional.

There are several libraries that mimic kue's functionality backed by Postgres. I looked at pg-boss, pg-job-queue, and a few others. My favorite API was pg-boss, since it feels like a typical javascript event handler / server.

If we used pg-boss, we'd have a worker/index.js that exported one function that registered the job handlers and ran boss.start() – then in server/index.js, we'd have

if (process.env.JOBS_SAME_PROCESS == '1') {
    require('../worker/index.js').start()
}

and then everything else would be the same, right? Or am I misunderstanding how the job queue works on Lambda?

schuyler1d commented 6 years ago

hm. So the goal of JOBS_SAME_PROCESS is to keep/use the same process/thread as the request to process the job. In a 'thread rich' environment (when you don't have a limit on number of threads like in AWS lambda, or busy/blocked threads triggers scaling), then this is useful.

In a Heroku or classic server environment, your simultaneous web requests are often limited by a static thread count (with nginx/apache). If a single request continues to run for a long time, then that's bad because it blocks use of that thread for any new incoming requests.

So JOBS_SAME_PROCESS is kind of 'pre-dispatch' -- the point is there isn't a process queue at all in that deployment, and I'd want to keep it that way -- maybe make it even more so and not create the JOBS at all.

My second concern with redis job queues (or other job queues like SQS) specifically for Spoke is the contacts upload one -- besides that, almost none of the other jobs really require an asynchronous job -- they complete pretty quickly and could be synchronous for everyone. However, for the contact upload one, the issue is that the 'job' architecture curerntly stores the entire file on the queue, which could be enormous -- and redis seems like a bad place (even worse than a database) to store large file objects...

So that's why I'd like us to focus on use-cases first.

ben-pr-p commented 6 years ago

Two main jobs I'm planning:

  1. Osdi list exchange (#507) – we'll want to do this asynchronously on the server (like data warehousing), and communicate to users number of pages completed, and number of contacts successfully synced per page.

  2. Cell/land lookup, which I'm still specing out (issue coming later). I think for asynchronously loaded contacts (like osdi list exchange, or potentially data warehousing), it should act as a filter on the data being loaded in. For user csv uploads, I still have to think through it, but right now in my mind it's a button on the campaign ("Filter out Landlines") and it goes an removes landlines asynchronously from the contacts. Communicating progress and completion effectively would also be important here, since you wouldn't want to start the campaign until your landlines were gone.

ben-pr-p commented 6 years ago

Upon closer investigation, it looks like there a pretty feature-rich job queue already built here – I'll document the job creation, updating, and status fetching APIs a bit more and that'll enable seeing which features are actually missing, if any are at all

bchrobot commented 6 years ago

It would be great to add some kind of rate limiting to the text dispatch. Even with texts getting distributed among 100 Twilio numbers we are still having trouble with carriers flagging messages for texters who are especially speedy. (I searched through open and closed issues and didn't find anything existing about this)

This may belong in a separate issue -- if so I'll move it.

harpojaeger commented 6 years ago

I think @bchrobot makes a good point; using kue or something like it would allow the actual sending of texts to be scaled and distributed among Twilio numbers and over time much more intelligently.

My OSDI survey question action handler also runs synchronously right now, which is not a great practice for an external API call, and it doesn't fail in an intelligent way. kue has built-in retries with exponential backoff, which would be exactly right for that use case.

Definitely further down the line, but I could see this being very worth doing post-November.

harpojaeger commented 6 years ago

Alternatively, I could implement kue just for my action handler and we could see if we like it before starting to deploy it in other parts of the application. Thoughts?

joemcl commented 5 years ago

Hey @lperson it would be really cool to get this moved along so we can implement OSDI stuff better for Spoke. @harpojaeger @ben-pr-p @bchrobot are there pieces of this that @lperson or I can dive in to?

lperson commented 5 years ago

@joemcl what is the exact use case you're thinking about?

schuyler1d commented 5 years ago

I'd like to avoid Redis being a hard requirement. If someone wants to support Kue for more job processing then:

joemcl commented 5 years ago

@joemcl what is the exact use case you're thinking about?

@lperson getting this going is a precursor to using #690 , and for getting an API connection working that @codygordon can connect to for other stuff, I believe. cc: @deasterdaywfp .

lperson commented 5 years ago

@joemcl an API connection to what?

joemcl commented 5 years ago

@lperson Votivate. VAN/Votebuilder. Other platforms.

lperson commented 5 years ago

@joemcl I don't think kue is the primary issue here for you. I'd be interested in how you want to integrate and whether @harpojaeger's work satisfies that requirement. Are you only interested in survey responses going to external systems? Are you doing that now with manual exports? Are there other OSDI APIs you'd want to use?

schuyler1d commented 5 years ago

++ let's not overload this issue with possible use-cases -- let's make the use-cases separate issues, and if relevant link them here.

ibrand commented 4 years ago

@schuyler1d Am I right that this issue currently would look like this reply of yours if it were to be implemented? Is this something we want to keep open?

I'd like to avoid Redis being a hard requirement. If someone wants to support Kue for more job processing then:

  • It needs to be possible for Kue to run on a separate Redis instance than the application caching -- to avoid throwing out the cache from a large upload
  • please abstract it out, so that the existing jobs processing is the default queueing option
  • respect JOBS_SAME_PROCESS being synchronous when turned on -- For twilio this allows processing (and sending) without ANY job queue, which is quite useful, and for all jobs that will reliably process in less than 30 seconds should allow to run when JOBS_SAME_PROCESS synchronously to the request
  • The kue implementation would need in-application guards against memory exhaustion -- it shouldn't be possible for an instance to accidentally lose (or be unable to send/receive) messages because a file that is too large was uploaded to the redis instance.
Frydafly commented 3 years ago

Following up here since this issue is 3 years old -- is this still a priority? Have there been any updates?

schuyler1d commented 3 years ago

I think with src/extensions/job-runners/ there's work if someone wants to implement with kue, but they would then be directed to doing it in that extension space, so I think we can close