Use `kue` instead of jobs table

ben-pr-p commented 6 years ago

My understanding of the current setup is that right now, asynchronous jobs are accomplished via a jobs table, and there is a separate process (or separate Lambda event) to go and process them every 5 minutes (or how often one configures their lambda events).

Advantages of dropping the Postgres solution and moving to Redis (kue)[http://automattic.github.io/kue/]:

Jobs processed instantly / based on throughput via Redis pubsub, meaning under low load, there is ms delay, and under high load one can configure the processing time to increase without increasing the load in the DB (which is what would potentially degrade performance in the unaffected parts of the app), solving any outstanding problems with #255.
That nice GUI!
Better developer experience for adding jobs, via their callback registering system. Right now adding a job requires updating several different data structures
Identical processing interface for single and multi-core runtime configurations

schuyler1d commented 6 years ago

When JOBS_SAME_PROCESS=1 most of the 'jobs' don't need to be a job at all. Maybe that should be the first step?

If we use Redis, I would want to make sure it could be optional (i.e. ok to stay in the DB, by default) and independent of the intended Redis caching server intended for the scaling work, since the long-running jobs (csv processing) deal with large storage and I wouldn't want that to make the caching size unstable.

From MoveOn's perspective the most important job to track is actually the datawarehouse version of the contact upload -- everything else could just be synchronous (though the regular csv contact upload might be usefully async).

Are you planning on expanding background tasks and want a proper system for it, or are you focusing on the current ones?

ben-pr-p commented 6 years ago

Both – I think developing / modifying the current jobs, understanding what the current jobs do, and adding new jobs would be easier with a external job management queue library than the in-house solution that exists right now (we would also get a bunch of features for free that we'll probably want in the future).

Some things we can improve: * Clarity in code around where jobs are and which function does what job

Number of files / places that must be changed in order to add a job
Number of if statements / options switched by JOBS_SAME_PROCESS. Ideally, there should be only one or 0 if possible switches.

Some solutions / options – (kue)[https://github.com/Automattic/kue] and (bull)[https://github.com/OptimalBits/bull] are the most feature rich, but I do agree with keeping Redis optional.

There are several libraries that mimic kue's functionality backed by Postgres. I looked at pg-boss, pg-job-queue, and a few others. My favorite API was pg-boss, since it feels like a typical javascript event handler / server.

If we used pg-boss, we'd have a worker/index.js that exported one function that registered the job handlers and ran boss.start() – then in server/index.js, we'd have

if (process.env.JOBS_SAME_PROCESS == '1') {
    require('../worker/index.js').start()
}

and then everything else would be the same, right? Or am I misunderstanding how the job queue works on Lambda?

schuyler1d commented 6 years ago

hm. So the goal of JOBS_SAME_PROCESS is to keep/use the same process/thread as the request to process the job. In a 'thread rich' environment (when you don't have a limit on number of threads like in AWS lambda, or busy/blocked threads triggers scaling), then this is useful.

In a Heroku or classic server environment, your simultaneous web requests are often limited by a static thread count (with nginx/apache). If a single request continues to run for a long time, then that's bad because it blocks use of that thread for any new incoming requests.

So JOBS_SAME_PROCESS is kind of 'pre-dispatch' -- the point is there isn't a process queue at all in that deployment, and I'd want to keep it that way -- maybe make it even more so and not create the JOBS at all.

My second concern with redis job queues (or other job queues like SQS) specifically for Spoke is the contacts upload one -- besides that, almost none of the other jobs really require an asynchronous job -- they complete pretty quickly and could be synchronous for everyone. However, for the contact upload one, the issue is that the 'job' architecture curerntly stores the entire file on the queue, which could be enormous -- and redis seems like a bad place (even worse than a database) to store large file objects...

So that's why I'd like us to focus on use-cases first.

What jobs that don't exist are you planning?
What jobs currently suffer from the existing architecture and how?

ben-pr-p commented 6 years ago

Two main jobs I'm planning:

Osdi list exchange (#507) – we'll want to do this asynchronously on the server (like data warehousing), and communicate to users number of pages completed, and number of contacts successfully synced per page.
Cell/land lookup, which I'm still specing out (issue coming later). I think for asynchronously loaded contacts (like osdi list exchange, or potentially data warehousing), it should act as a filter on the data being loaded in. For user csv uploads, I still have to think through it, but right now in my mind it's a button on the campaign ("Filter out Landlines") and it goes an removes landlines asynchronously from the contacts. Communicating progress and completion effectively would also be important here, since you wouldn't want to start the campaign until your landlines were gone.

ben-pr-p commented 6 years ago

Upon closer investigation, it looks like there a pretty feature-rich job queue already built here – I'll document the job creation, updating, and status fetching APIs a bit more and that'll enable seeing which features are actually missing, if any are at all

bchrobot commented 6 years ago

It would be great to add some kind of rate limiting to the text dispatch. Even with texts getting distributed among 100 Twilio numbers we are still having trouble with carriers flagging messages for texters who are especially speedy. (I searched through open and closed issues and didn't find anything existing about this)

This may belong in a separate issue -- if so I'll move it.

harpojaeger commented 6 years ago

I think @bchrobot makes a good point; using kue or something like it would allow the actual sending of texts to be scaled and distributed among Twilio numbers and over time much more intelligently.

My OSDI survey question action handler also runs synchronously right now, which is not a great practice for an external API call, and it doesn't fail in an intelligent way. kue has built-in retries with exponential backoff, which would be exactly right for that use case.

Definitely further down the line, but I could see this being very worth doing post-November.

harpojaeger commented 6 years ago

Alternatively, I could implement kue just for my action handler and we could see if we like it before starting to deploy it in other parts of the application. Thoughts?

joemcl commented 5 years ago

Hey @lperson it would be really cool to get this moved along so we can implement OSDI stuff better for Spoke. @harpojaeger @ben-pr-p @bchrobot are there pieces of this that @lperson or I can dive in to?

lperson commented 5 years ago

@joemcl what is the exact use case you're thinking about?

schuyler1d commented 5 years ago

I'd like to avoid Redis being a hard requirement. If someone wants to support Kue for more job processing then:

It needs to be possible for Kue to run on a separate Redis instance than the application caching -- to avoid throwing out the cache from a large upload
please abstract it out, so that the existing jobs processing is the default queueing option
respect JOBS_SAME_PROCESS being synchronous when turned on -- For twilio this allows processing (and sending) without ANY job queue, which is quite useful, and for all jobs that will reliably process in less than 30 seconds should allow to run when JOBS_SAME_PROCESS synchronously to the request
The kue implementation would need in-application guards against memory exhaustion -- it shouldn't be possible for an instance to accidentally lose (or be unable to send/receive) messages because a file that is too large was uploaded to the redis instance.

joemcl commented 5 years ago

@joemcl what is the exact use case you're thinking about?

@lperson getting this going is a precursor to using #690 , and for getting an API connection working that @codygordon can connect to for other stuff, I believe. cc: @deasterdaywfp .

lperson commented 5 years ago

@joemcl an API connection to what?

joemcl commented 5 years ago

@lperson Votivate. VAN/Votebuilder. Other platforms.

lperson commented 5 years ago

@joemcl I don't think kue is the primary issue here for you. I'd be interested in how you want to integrate and whether @harpojaeger's work satisfies that requirement. Are you only interested in survey responses going to external systems? Are you doing that now with manual exports? Are there other OSDI APIs you'd want to use?

schuyler1d commented 5 years ago

++ let's not overload this issue with possible use-cases -- let's make the use-cases separate issues, and if relevant link them here.

ibrand commented 4 years ago

@schuyler1d Am I right that this issue currently would look like this reply of yours if it were to be implemented? Is this something we want to keep open?

I'd like to avoid Redis being a hard requirement. If someone wants to support Kue for more job processing then:

It needs to be possible for Kue to run on a separate Redis instance than the application caching -- to avoid throwing out the cache from a large upload

please abstract it out, so that the existing jobs processing is the default queueing option

respect JOBS_SAME_PROCESS being synchronous when turned on -- For twilio this allows processing (and sending) without ANY job queue, which is quite useful, and for all jobs that will reliably process in less than 30 seconds should allow to run when JOBS_SAME_PROCESS synchronously to the request

The kue implementation would need in-application guards against memory exhaustion -- it shouldn't be possible for an instance to accidentally lose (or be unable to send/receive) messages because a file that is too large was uploaded to the redis instance.

Frydafly commented 3 years ago

Following up here since this issue is 3 years old -- is this still a priority? Have there been any updates?

schuyler1d commented 3 years ago

I think with src/extensions/job-runners/ there's work if someone wants to implement with kue, but they would then be directed to doing it in that extension space, so I think we can close

StateVoicesNational / Spoke

Use `kue` instead of jobs table #526