Automattic / kue

Kue is a priority job queue backed by redis, built for node.js.
http://automattic.github.io/kue
MIT License
9.46k stars 866 forks source link

How to processing multiple jobs in a batch? #930

Open twang2218 opened 8 years ago

twang2218 commented 8 years ago

Assuming we have many jobs of insert data to a database in the queue, instead of process each jobs one by one, which is not very efficient, I expect to be able to grab multiple jobs, and insert those data in one go by using db's bulk insert command. How can I do that?

Is there anyway to fetch all jobs(or a given number of jobs) from queue, process them and then mark them as done? Thanks.

behrad commented 8 years ago

I would put my batch inside a single job data instead of a job per data model :)

twang2218 commented 8 years ago

The data came from the frontend servers, maybe it's just a user click action to be logged into database, or maybe it's a sensor data came from an IoT devices, so, to have better response time, there isn't any aggregation at the frontend apps, they just simply push the event/sensor data to the queue, so the backend worker could pull them from the queue in batch and utilize the bulk insert command provided by the database for better performance.

wiresurfer commented 8 years ago

One approach could be to have two queues. Say Events and BatchedEvents. Events could be getting jobs from the Frontend as they get created. Events can be consumed by a Batch Processor, which can have the batching strategy. [Take 5 and then push to BatchedEvents] Batched Events will thus receive packets of 5 events. and its consumer can work on the batches.

So you end up with something like this

Front end ---EventQ---> [Batch Processor] x 1 ----BatchedQ----> [Bulk Consumers] x N

Only thing to keep in mind is Event Stream would be constrained to just have one consumer. [Or else you would loose any sense of order, batching would be slower(Each worker would need to fill up atleast 5 events.) etc. So the solution would be a no go.

kirps commented 8 years ago

Just to add to what @wiresurfer said, I would NOT recommend trying to create any hierarchy of jobs on one kue. I lost weeks trying to manage a batch -> multiple jobs hierarchy by creating a job of batch type and a job with a unit of work which would then be both consumed off the same kue. It just doesn't have the data model to handle this

wiresurfer commented 8 years ago

As @kirps points out, it gets out of hand pretty quickly. More elaborate queue implementations hence have carefully crafted batched producers/consumers [kafka et al] or concepts like lazy queues in the rabbitmq/amqp world which are geared towards low memory yet long queues which load data up only when a consumer needs it. So in cases of large volumes of events being generated from say an app or user interaction, the events stick around on disc and are flushed as and when consumers are free to consume. This way your UI/App doesn't care about the event getting processes [a perceived swiftness in response due to the operation being a send an forget asyc op]
@kirps I was curious as to the exact nature of what you were trying to solve and the frustrations you faced. I am planning out a project and would love to hear the gotchas

kirps commented 8 years ago

Sorry @wiresurfer I wasn't monitoring this issue.

My use case is using Kue to create and ultimately execute weekly email jobs. These are summaries of activity in our system over the past week per customer. We send them to each customer stakeholder individually with Mandrill. I'm using Heroku and thus kue instead of something like a cron job.

I originally tried to create a "batch" job per company, which would maintain the appropriate day for the company to get their emails and sit on the queue and track progress as it scheduled and completed individual jobs. However, I had a lot of trouble maintaining membership in the batch with the available tools. But the biggest issue was that I could not ensure job uniqueness. Given that it's an email to a person, I don't want it getting sent multiple times if the done function doesn't work or if Heroku happens to restart my dyno at the wrong time.

I ended up just creating each email job individually, bypassing the batch and any additional interesting (but unneccessary) information I might get from that level of hierarchy.

I did have to implement my own job map to (mostly) ensure uniqueness. There's a small chance that this could still fail if the job executed and heroku restarted at exactly the wrong time, but so far that has not proved to be an issue. You can see some discussion here: #856

SukantGujar commented 7 years ago

This is a very common use case and would love to have kue bake in support for this. One ugly workaround is to store the batches outside of kue till you hit sufficient batch size/time and then create the batch job in que with all the batched data. A refinement to this is to store the batch in the same redis instance which you setup for kue and then simply specify the redis batch key reference to the batch job processor as part of the batch job data. This way you don't have to clone the batch data into batch job.