running unique jobs? - Githubissues

maia commented 8 years ago

This is more of a question than an issue, but after reading the announcement I'm very excited about the option of using gush:

1) Is it possible to enforce jobs only being run once (for a certain amount of time), similar to the "unique jobs" feature in Sidekiq Enterprise?

Here's an example of my use case: my app parses a stream incoming tweets, and for each url in a tweet it will follow the redirects of the url (it might be a short url), then parse the html-file, then search for images, then download these images. If multiple tweets within a batch contain the same url, I do not want to re-query this url (within a certain amount of time, if I re-encounter it a day later, well then I will re-query it).

Currently (with Sidekiq 4.1) race conditions of multiple parallel workers cause urls to be enqueued in parallel. I certainly could save that an url is enqueued in sql/redis and check for that when queueing, but this adds complexity to the code. Also this is just one, simplified example.

What I'm looking for is a way to enqueue the processing of an array of tweets, process them in parallel (including optional followup-jobs such as parsing the html, extracting keywords, downloading images), and once all tweets have completed with all these possible sub-routines, followup with another job. Therefor:

2) Can I design a workflow to not run a job after all batches of the previous job have completed, but after all batches of the previous job plus all followup-jobs of these batches have completed?

(Having each job enqueue followup-jobs also gives me the advantage that while some workers may still process tweets, others might parse a html-document, and others might already process images. This is helpful as it will not first run everything that's CPU-bound to then run everything that's I/O bound, and I'd prefer to spread this out even more, but am unsure how I could do so)

Is this a use case where gush will help? Thanks a lot.

Update: it looks as if my second question is related to #26 .

pokonski commented 8 years ago

1) what you are describing isn't a use case Gush was created for. You should preprocess that batch of tweets before and then extract unique URLs from them. Only after that, process them in parallel.

2) yes, second one is related to #26. Additionally you can look up dynamically adding jobs in the workflow: https://github.com/chaps-io/gush/issues/24#issuecomment-166703156

Do share an example if you manage to work it out with Gush, I'm curious!

pokonski commented 7 years ago

Closing for inactivity

chaps-io / gush

running unique jobs? #27