edgurgel / verk

A job processing system that just verks! 🧛‍
https://hex.pm/packages/verk
MIT License
723 stars 65 forks source link

Batch processing? #196

Closed heyitsjames closed 4 years ago

heyitsjames commented 4 years ago

I've been using Verk for the past year with great success. However, I have a current project that requires jobs to be processed in batches of 500, with a timeout of, say, 5 seconds. This means that my process will wait 5 seconds, or until it has 500 jobs to process, then will process the jobs accordingly.

I'm wondering what people more clever than I would do to solve this problem. Is there any way in Verk to chunk jobs into batches, or should I be looking at other libraries and solutions to solve this? I'm looking into GenStage, and ETS/DETS as possible solutions, but I like the "a job is never only in memory" guarantee that Verk affords me, and I'd like to stick to that ethos. If there's anything in Verk itself that can do this, I'd love to know, and if not, if anyone has solved such a problem, it would be lovely to hear how. Thank you in advance

edgurgel commented 4 years ago

Hey @heyitsjames ,

Maybe you could try writing workers that simply "synchronise" work by talking to a single process accumulating inside an ETS table?

This single process would simply hold an ETS table.

Once 500 is reached (your work adding jobs directly to this ETS table can find out if 500 was reached) or the time is up (5 seconds) then execute the work and reply back to all workers that are waiting? The jobs will be done once the batch is processed?

There are issues with this implementation as if you reach say 1500 jobs very quickly you may have a bottleneck.

This is not trivial to be solved as there's a lot of coordination. Maybe some redis scripts ? 😆 Feel free to look at the existing ones inside Verk 1.* (Verk 2.0 is coming one day and it uses Redis Streams)

heyitsjames commented 4 years ago

@edgurgel Thank you for the response!

As an update, I ended up using a Verk queue with a max of 500 jobs, and then using GenStateMachine to handle batching and timeout/flush operations.

Ended up working quite nicely, is very performant, and can scale nicely with adding more queues and linking them to their own batcher.

Each of my worker processes declares a receive block that the batcher can send success or failure, which allows me to use all the retry/requeue niceties that Verk affords me.

If I have time, it honestly doesn't seem like a stretch to add this into Verk itself. Instead of perform, there could be a batch_perform behaviour with timeout options.

Closing this now, if that's ok! Thanks for your help.