Future Jobs and Merging Duplicate Tasks

Hi Brian -

My requirement is a method to stagger creation of a Sum job. I'm struggling to determine a "native" Gearman approach to this, and, if it exists, how GearmanManager plays into it.

Way-simple example. Page gets hit. Insert goes to No-SQL db. Adds a gearman job to Sum all page views as of a parameterized Now, set for X minutes in the future. Another Insert goes in. This time it doesn't add a gearman job, because there's no point.

Of course, this could be way more unmitigated activity against the Gearman Job Queue servers than is prudent. Which leads me toward a more conventional CRON job. CRON job would create X Gearman Jobs to chew up the necessary aggregations in a memory-efficient manner based on some kinda query on the No-SQL index in question.

If it's not leaving some cool Gearman functionality on the table, I'm going with some version of Option 2.

Hope you're well. Thanks!

Hi Bart

I'm trying to figure out what you actually want to do. Do you want to rate limit the jobs ?

On Sun, Oct 25, 2015 at 7:47 PM, Bart Clarkson notifications@github.com wrote:

Hi Brian -

My requirement is a method to stagger creation of a Sum job. I'm struggling to determine a "native" Gearman approach to this, and, if it exists, how GearmanManager plays into it.

Way-simple example. Page gets hit. Insert goes to No-SQL db. Adds a gearman job to Sum all page views as of a parameterized Now, set for X minutes in the future. Another Insert goes in. This time it doesn't add a gearman job, because there's no point.

Of course, this could be way more unmitigated activity against the Gearman Job Queue servers than is prudent. Which leads me toward a more conventional CRON job. CRON job would create X Gearman Jobs to chew up the necessary aggregations in a memory-efficient manner based on some kinda query on the No-SQL index in question.

If it's not leaving some cool Gearman functionality on the table, I'm going with some version of Option 2.

Hope you're well. Thanks!

— Reply to this email directly or view it on GitHub https://github.com/brianlmoon/GearmanManager/issues/126.

Hi @conradjones, thanks for the reply. It's not so much about limiting the rate of a type of job, at least not how I'm presently thinking about it. It's more about limiting the amount of times a job is run based on the job_data being passed.

To take the above example a bit further, imagine a single cloud-based SaaS with a bunch of paid Accounts. And the Accounts have usage limits. Say the usage limit is based on total views of a page. You track these page views with an insert to some database. The only question is how you will perform a Sum of the page views, by Account, in an efficient, non-locking way.

My inquiry, then, was can one create a GearmanJob called, say, "Net_Gearman_Job_UpdateAccountViews", pass it job_data like "{account_id:123}", set the "when_to_run" of this job to fifteen minutes in the future, and then NOT add another job of type "Net_Gearman_Job_UpdateAccountViews" IF another job of this type is added with the same account_id?

Now that I've written it all out, I could almost certainly do a mysql query before adding the job that basically asks all those questions before adding the job, since I'm using a mysql-based queue. But given the need to deserialize the object in the data, I begin to see where this probably fails in the need for "efficiency". Perhaps memcache could perform this task better by avoiding serialize/deserialize of the job data object.

When I look at the mysql table, I note the "unique_key" value. If an additional field existed, say an optional "merge_key", one could perform my task efficiently by populating the gearman-queue with some mysql-flavor of an UPSERT query, wherein the merge_key in our example would be the account_id.

In any event, I think I've pretty much established that one must presently make one's own arrangements to accomplish my specific goal. Please Close this out if I'm not mistaken.

Thanks!

Certainly that would be outside the scope of gearman, i think the idea is to keep it as simple as possible and that these kind of challenges should be handled in the application.

The reason I asked about rate limiting is that was a challenge I faced with gearman, I initially was placing the jobs in a mysql database and had a daemon dispatching the jobs from the mysql table into the the gearman job server. I am now using Sharq as an initial queue to put the jobs in, and a daemon dispatches the jobs from the sharq queue into the gearman queue (it is a little more complicated as there are actually two sharq queues due to the distributed nature of the system)

I decided against modifying gearmand and gearman-manager and instead adding an additional layer (the dispatching daemons) because as soon as you start modifying things like gearman it become much more difficult to update to the latest version if you have effectively ported it and need to backport the new changes into it, it could become unmanageable very quickly.

sharq uses redis, you may find you could modify sharq to achieve your goal or at least inspiration.

On Sun, Oct 25, 2015 at 9:23 PM, Bart Clarkson notifications@github.com wrote:

Hi @conradjones https://github.com/conradjones, thanks for the reply. It's not so much about limiting the rate of a type of job, at least not how I'm presently thinking about it. It's more about limiting the amount of times a job is run based on the job_data being passed.

To take the above example a bit further, imagine a single cloud-based SaaS with a bunch of paid Accounts. And the Accounts have usage limits. Say the usage limit is based on total views of a page. You track these page views with an insert to some database. The only question is how you will perform a Sum of the page views, by Account, in an efficient, non-locking way.

My inquiry, then, was can one create a GearmanJob called, say, "Net_Gearman_Job_UpdateAccountViews", pass it job_data like "{account_id:123}", set the "when_to_run" of this job to fifteen minutes in the future, and then NOT add another job of type "Net_Gearman_Job_UpdateAccountViews" IF another job of this type is added with the same account_id?

Now that I've written it all out, I could almost certainly do a mysql query before adding the job that basically asks all those questions before adding the job, since I'm using a mysql-based queue. But given the need to deserialize the object in the data, I begin to see where this probably fails in the need for "efficiency". Perhaps memcache could perform this task better by avoiding serialize/deserialize of the job data object.

When I look at the mysql table, I note the "unique_key" value. If an additional field existed, say an optional "merge_key", one could perform my task efficiently by populating the gearman-queue with some mysql-flavor of an UPSERT query, wherein the merge_key in our example would be the account_id.

In any event, I think I've pretty much established that one must presently make one's own arrangements to accomplish my specific goal. Please Close this out if I'm not mistaken.

Thanks!

— Reply to this email directly or view it on GitHub https://github.com/brianlmoon/GearmanManager/issues/126#issuecomment-150973805 .

Cool @conradjones ! So yeah, you nailed it. Thank you for describing an interesting way to add an intermediary queue. I'm not familiar with Sharq, but am considering daemonizing a command on top of an Elastic Search index, which sounds really similar to the nature of what you're describing.

I've elected, for the present, to back off an event-driven approach and use a crontab to spin off gearman jobs that divide my pool of accounts up and do a SUM starting at the billing cycle start date of each account. The way I figure it, such a job is necessary to ensure the integrity of my counts for billing needs. But it would be nice to add a proper daemon later for faster, floating counts.

Have an awesome day.

brianlmoon / GearmanManager

Future Jobs and Merging Duplicate Tasks #126