Automattic / kue

Kue is a priority job queue backed by redis, built for node.js.
http://automattic.github.io/kue
MIT License
9.45k stars 865 forks source link

Jobs stuck in inactive state #130

Closed mikemoser closed 9 years ago

mikemoser commented 12 years ago

Jobs get stuck in the inactive state fairly often for us. We noticed that the length of q:[type]:jobs is zero, even when there are inactive jobs of that type, so when getJob calls blpop, there is nothing to process.

It looks like this gets set when a job is saved and the state is set to inactive using lpush q:[type]:jobs 1. We're wondering if this is failing in some cases and once the count is off, jobs remain unprocessed.

Has anyone else seen this issue?

sebicas commented 12 years ago

Yes, I am having the same problem... do you have a patch?

mikemoser commented 12 years ago

We do not have a patch yet - as we're not sure of the root cause. Right now we are wondering if there is an intermittent problem with the lpush call below.

Job.prototype.state = function(state){
  var client = this.client;
  this.removeState();
  this._state = state;
  this.set('state', state);
  client.zadd('q:jobs', this._priority, this.id);
  client.zadd('q:jobs:' + state, this._priority, this.id);
  client.zadd('q:jobs:' + this.type + ':' + state, this._priority, this.id);
  //increase available jobs, used by Worker#getJob()
  if ('inactive' == state) client.lpush('q:' + this.type + ':jobs', 1);
  return this;
};

We have added some diagnostics in job.js ctor to log errors for the client and waiting to repro:

this.client.on('error', function (err) {
  console.log('redis job client error ' + err);
});

This may not be the cause, if anyone else has ideas or a patch we would love to know. They get stuck often so I was surprised that more people have not run into this.

sebicas commented 12 years ago

Is very extrange... I am trying to find the cause...

When I call: jobs.inactive() I got the Jobs IDs# [ '147', '149', '144', '164', '168', '172', '176' ]

But for some reason jobs.process() don't see them not process them.

sebicas commented 12 years ago

I am able to reproduce the problem when stoping a worker... tasks tacked by the worker will be stuck in inactive state.

if I run:

jobs.process 'email', 4, (job, done) ->

1 task remains inconclusive in active state and 3 more remains stuck on inactive state forever.

If I reactive the worker all the other pending task are been processed but the ones I mention are getting stuck forever.

spollack commented 12 years ago

We have just seen this issue as well (job stuck in the inactive state).

mikemoser commented 12 years ago

@sebicas do you have more details on your repro? Not sure what it means to be "tacked" by a worker while still inactive?

sebicas commented 12 years ago

@mikemoser not sure if "tacked" was the right word... what I tried to said if that for some reason the amount of stuck tasks is some way related to the number of simultaneous tasks indicated on the job.

For example if I do:

jobs.process 'email', 4, (job, done) ->

4 tasks will be stuck

jobs.process 'email', 6, (job, done) ->

6 tasks will be stuck

and so on...

dfoody commented 11 years ago

We use kue to manage somewhere between 1k-20k jobs per day and see the same problems. For us sometimes it's once a week. Other times multiple per day.

Unfortunately, the root cause of these issues are likely fundamental to the way kue is written - since changes are applied serially in kue, not as an atomic transaction, any little glitch / crash can cause the items in a job to get partially applied, leading to the need to manually repair "broken" jobs.

We're at the stage where we're deciding whether to rewrite the innards of kue to be more reliable, or whether to move to something else. Any thoughts would be appreciated.

sebicas commented 11 years ago

Unfortunately we are in the same situation as @dfoody :(

tj commented 11 years ago

should be pretty trivial to make things atomic, I dont have time to look at it right now but even at worst we could use a little lua script. Though even if this portion is fully atomic there's always the chance of something being stuck, if the process is killed etc.. really I think the bigger problem is that we need to recover from half-processed jobs etc

sebicas commented 11 years ago

I agree, besides making things atomic... is process is killed in the middle of jobs execution, that causes the job to get stuck... @visionmedia any suggestions in how to solve that?

tj commented 11 years ago

off the top of my head I can't think of any way really to differentiate between an active job and an active job whose's proc died. We could possibly tie PIDs into the whole thing, or alternatively just "timeout" those jobs, if it's been active for N minutes and it's not complete kill it and retry

spollack commented 11 years ago

FYI, there is a pull request here https://github.com/LearnBoost/kue/pull/105 (thanks @dfoody) that incorporates a watchdog to check/fail jobs that have run more than a configurable period of time. We have been using this code successfully in our project.

tj commented 11 years ago

I do think they could be separated a bit, it's a pretty huge patch, I dont think some of that belongs in core and it takes more time to review really big pull-requests that have a larger scope

sebicas commented 11 years ago

@dfoody just mention he still have stuck jobs, so I guess his patch didn't solve the problem completely.

dfoody commented 11 years ago

There are really two separate issues here:

(1) What do you do with jobs that legitimately fail (this is where the watchdog enhancement I put in does work well - as long as you're sure that, when the watchdog fires, it's really failed and not just slow - so set your timeouts appropriately). The only alternative to really know if jobs have died or not is to use something like zookeeper under the covers (which has a nice feature that locks can automatically be released when a process dies).

(2) What happens when kue's data structures get corrupted. This is happening to us a lot right now, due to a combination of factors we believe: We're now doing calls across Amazon availability zones (Redis in a different AZ from Kue servers - increasing the latency between a series of redis requests) and we're now running significantly more kue servers than we were before. We think it's this combination of factors causing issues us to see the corruptions much more often. This is where moving to atomic redis operations (with appropriate use of 'watch') will hopefully help.

mikemoser commented 11 years ago

@dfoody - thanks for qualifying. To be clear, this issue represents (2) we have a very basic setup and we see the redis indexes as describe above get out of sync before a job is ever processes, so they just stay in the inactive state. It happens a lot, however we can not get a consistant repro. Does anyone have a repro?

dfoody commented 11 years ago

@mikemoser given what you describe - does your process that queues the job quit soon after queuing it? If so, that's almost certainly your issue. Not all redis operations are completed by the time the callback is called from kue typically, so you need to make sure the process doesn't die for at least a few seconds to give kue/redis enough time to finish doing all the operations that are part of queuing a job.

behrad commented 11 years ago

We've also seen the same issue, our queuing processes are live ones. Any workarounds? is this issue finally clarified?

mikemoser commented 11 years ago

@dfoody our worker process is always running, so kue should have all the time it needs to finish the operation of adding a job (e.g. call line 447 of kue/lib/jobs.js to increment the index by one for the new job). We are not able to get a consistant repro, so it's proving hard to fix, however we see it happen all the time. I want to reiterate this issue is about "new" jobs never getting out of the inactive sate, not jobs in process that get stuck. Those of you that said you've seen the same behavior is it the "new jobs stuck in inactive state and never getting processed?"

dfoody commented 11 years ago

I've not seen the case on our side where jobs get stuck in inactive without something else happening around the same time (e.g. a crash, etc. that corrupts the data, AWS having "minor" issues like all of EBS going down. etc). But, when a job does get stuck, we have to manually correct things before it starts moving again.

That said, we're running off my fork, not the original (which has lots of QoS and reliability changes).

One other thing to try: Have you restarted Redis recently? We have seen that sometimes redis does need a restart and that fixes some things.

edwardmsmith commented 11 years ago

We're seeing similar behavior as well.

What we see is that new jobs are stuck in an inactive state.

We have concurrency set to 1, but have a cluster of 4 processes.

Looking at Redis, we currently have two 'inactive' jobs.

When a new job is created, the oldest of the two inactive jobs suddenly gets processed.

So, we have, essentially, the two newest jobs always stuck - until they're displaced by new jobs.

mikemoser commented 11 years ago

There seem to be two causes for new jobs to never get processed and stuck in the inactive state.

  1. Indexes are out of sync (e.g. my original post)
  2. BLPOP not responding when a new items are added

@edwardmsmith it sounds like your symptoms are related to #1. You can verify this by checking the if llen q:[type]:jobs is less than zcard q:jobs:[type]:inactive. We added a watchdog for each job type that checks to see if they are out of sync and corrects it by calling lpush q:[type]:jobs 1 for however many jobs are inactive and not in the key used by process job.

After correcting #1, we still noticed jobs stuck in the inactive state. It seems that BLPOP becomes unresponsive for certain job types and those jobs never process, even though the redis indexes look good. We don't have a high-volume of jobs for these types and our theory is that something goes wrong with with the redis connection, but it fails silently and BLPOP just remains blocking and doesn't process any more jobs of that type. We have to restart our worker process and it starts processing all the jobs properly. Has anyone seen BLPOP exhibit this behavior?

We're considering switching to lpop and adding a setTimeout to throttle the loop, however we'd prefer to keep BLOP and not add what is essentially a high-frequency polling solution.

dfoody commented 11 years ago

This might help you.

Here's the rough set of steps we typically follow to repair various kue issues we see regularly: (note that some of these apply only to our fork - that has the ability to do "locks" so jobs for - in our case - a single user are serialized (users are differentiated based on their email address).

Failed Jobs Not Showing When there are failed jobs (the sidebar says non-zero), but none show in the list, use the follow this procedure to repair them:

redis-cli

zrange q:jobs:failed 0 -1

For each do hget q:job:NUM type until you find one that has 'type' null (or no 'type' field shows up) Then hgetall q:job:NUM to see the data values for it.

If there is no 'data' json blob, you can't recover - just delete the job as follows: hset q:job:NUM type bad zrem q:jobs:QUEUE:failed NUM (where QUEUE is specific queue the job was in - if you don't know which do this for each one)

That should make the jobs now appear. Then go into the Kue UI and delete the 'bad' job.

If that doesn't work (e.g. it corrupts the failed queue again), here's how to manually delete a job: zrem q:jobs NUM zrem q:jobs:failed NUM del q:job:NUM del q:job:NUM:state del q:job:NUM:log

Even if there is a 'data' json blob, other fields might be messed up. It's best to find out what type of job it is and who it applies to (via looking in the log files), do the above procedure and then kick off a new job (via the admin ui) to replace the corrupt one.

Jobs Staying in Queued Sometimes, jobs will stay in queued and not be allocated to a worker even if one is available. But, as soon as another job is queued, one will go out of queued and get processed (but one or more will still be "stuck" in queued).

First, find the queue that's misbehaving. The example below assumes QUEUE is it's name.

Find out how many jobs are queued: llen q:QUEUE:jobs zrange q:jobs:QUEUE:inactive 0 -1

There are two possible problems here: The number doesn't match between these two commands. The number matches and it's 0 for both, but a job is still showing in the UI To solve these problems:

1) Execute the following command as many times as is needed to make the numbers the same (e.g. if llen returns 0 and zrange returns 2 items, run it 2 times): lpush q:mail:jobs 1

2) In this case (they show up in the UI and when you do zrange q:jobs:inactive 0 -1), for each job showing up in the UI but not showing up in the above commands, it could be that the job is actually in a different state in reality, or two entries are invalid. Here's how to check: hget q:job:NUM state

 if the state is inactive do the following, do the following in this order:
      zadd q:jobs:mail:inactive 0 NUM
      lpush q:mail:jobs 1

 If the sate is not inactive, then you should remove it from the inactive list:
      zrem q:jobs:inactive NUM

Jobs Staying in Staged If jobs for a user stay in staged, and there are no other jobs for that user in inactive, active, or failed, this likely means that a previous job never released the lock correctly. Check if this is the case as follows (given the specific user's email): get q:lockowners:EMAIL

Assuming this shows a job number, get that job's current state: hget q:job:NUM state

If it's current state is complete, you just need to delete the job and that should get the queue flowing. You may also need to repair the staged queue if it's corrupt after deleting the job: zrem q:jobs:staged NUM

If you can't get to the specific job, try clearing the completed queue.

If the current state of the job that has the lock is 'staged', then you should move that job directly to 'inactive' manually in the UI (since it already has the lock it can go ahead and be moved to execute).

On Dec 7, 2012, at 2:24 PM, Michael Moser notifications@github.com wrote:

There seem to be two causes for new jobs to never get processed and stuck in the inactive state.

Indexes are out of sync (e.g. my original post) BLPOP not responding when a new items are added @edwardmsmith it sounds like your symptoms are related to #1. You can verify this by checking the if llen q:[type]:jobs is less than zcard q:jobs:[type]:inactive. We added a watchdog for each job type that checks to see if they are out of sync and corrects it by calling lpush q:[type]:jobs 1 for however many jobs are inactive and not in the key used by process job.

After correcting #1, we still noticed jobs stuck in the inactive state. It seems that BLPOP becomes unresponsive for certain job types and those jobs never process, even though the redis indexes look good. We don't have a high-volume of jobs for these types and our theory is that something goes wrong with with the redis connection, but it fails silently and BLPOP just remains blocking and doesn't process any more jobs of that type. We have to restart our worker process and it starts processing all the jobs properly. Has anyone seen BLPOP exhibit this behavior?

We're considering switching to lpop and adding a setTimeout to throttle the loop, however we'd prefer to keep BLOP and not add what is essentially a high-frequency polling solution.

— Reply to this email directly or view it on GitHub.

edwardmsmith commented 11 years ago

@mikemoser - Thanks for the reply - interestingly, I don't have a key (my job type is 'email') q:email:jobs at all:

> keys *email*
1) "q:jobs:email:active"
2) "q:jobs:email:failed"
3) "q:jobs:email:complete"
4) "q:jobs:email:inactive"

So I had two stuck jobs:

> zrange q:jobs:email:inactive 0 -1
1) "6063"
2) "6064"
> lpush q:email:jobs 1
(integer) 1
> lpush q:email:jobs 1
(integer) 1
> zrange q:jobs:email:inactive 0 -1
(empty list or set)
> 

So, that seems to have cleared out the stuck items for now.

@dfoody - Wow, thanks for that!

mikemoser commented 11 years ago

@edwardmsmith looks like your key was empty and it does seem that the indexes were our of sync. You can add a watchdog for each type to check this and correct it like we have.

@dfoody thanks for sharing - looks like y'all are having a lot of issues. We hope this is not a sign to come for us as we get more volume through kue. You state only 2 reasons for "Jobs Staying in Queued" however we have seen a third and where the numbers match on the indexes and they are greater than zero. This is where we just see the worker for that type sitting on the BLPOP command even through we are pushing new jobs to the key it's blocking on (e.g. lpush q:[type]:jobs 1). It really seems like BLPOP is just not responding when it should and never throwing and error. I'm not very experienced with redis in a high-volume senario, is BLPOP prone to this type of behavior? We are using OpenRedis to host, not sure if the extra network layers would effect this.

dfoody commented 11 years ago

We've not seen issues with BLPOP. But, we're on redis 2.4.x and it looks like OpenRedis is on 2.6.x (and, coincidentally, the latest 2.6 release has a fix for a BLPOP issue…)

We host redis ourselves, and it's entirely possible - if you're not local to your redis server - that that could be the cause of issues (though I've not looked at the underlying redis protocol layer to see how they implement it to know more concretely if that type of thing could be an issue - e.g. does it heartbeat the connection to detect failures, etc.).

On Dec 7, 2012, at 3:16 PM, Michael Moser notifications@github.com wrote:

@edwardmsmith looks like your key was empty and it does seem that the indexes were our of sync. You can add a watchdog for each type to check this and correct it like we have.

@dfoody thanks for sharing - looks like y'all are having a lot of issues. We hope this is not a sign to come for us as we get more volume through kue. You state only 2 reasons for "Jobs Staying in Queued" however we have seen a third and where the numbers match on the indexes and they are greater than zero. This is where we just see the worker for that type sitting on the BLPOP command even through we are pushing new jobs to the key it's blocking on (e.g. lpush q:[type]:jobs 1). It really seems like BLOP is just not responding when it should and never throwing and error. I'm not very experienced with redis in a high-volume senario, is BLPOP prone to this type of behavior? We are using OpenRedis to host, not sure if the extra network layers would effect this.

— Reply to this email directly or view it on GitHub.

mikemoser commented 11 years ago

We're thinking about changing the kue/lib/queue/worker.js getJob() function to no longer use BLPOP and just use LPOP with a setTimeout. Here is a change that we've been testing locally. Any thoughts?


/**
 * Attempt to fetch the next job. 
 *
 * @param {Function} fn
 * @api private
 */

Worker.prototype.getJob = function(fn){
  var self = this;

  // alloc a client for this job type
  var client = clients[self.type]
    || (clients[self.type] = redis.createClient());

  // BLPOP indicates we have a new inactive job to process
  // client.blpop('q:' + self.type + ':jobs', 0, function(err, result) {
  //   self.zpop('q:jobs:' + self.type + ':inactive', function(err, id){
  //     if (err) return fn(err);
  //     if (!id) return fn();
  //     Job.get(id, fn);
  //   });
  // });

  client.lpop('q:' + self.type + ':jobs', function(err, result) {
    setTimeout(function () {
      self.zpop('q:jobs:' + self.type + ':inactive', function(err, id){
        if (err) return fn(err);
        if (!id) return fn();
        Job.get(id, fn);
      });
    }, result ?  0 : self.interval);
  });
Jellyfrog commented 11 years ago

Any news on this? Looking to run > 200k jobs/day and need sth. stable since it will be kinda impossible to handle errors/stuck jobs manually.

mikemoser commented 11 years ago

We have determined and fixed the BLPOP not responding. There were a few factors in play:

  1. We use OpenRedis and they had some logic that would kill idle connections (not considering blocking operations).
  2. We use a database index greater that 0 (e.g. the default on a connection) and the reconnect logic in kue did not consider the currently selected database.

So, the reason the BLPOP appeared unresponsive was b\c it had connected to the wrong database instance (e.g. back to index 0). We fixed this by:

  1. OpenRedis made a change to stop killing idle connection for blocking commands.
  2. We added some logic in Kue to ensure the correct database was selected before calling BLPOP again.

kue/lib/queue/worker.js getJob()

/**
 * Attempt to fetch the next job. 
 *
 * @param {Function} fn
 * @api private
 */

Worker.prototype.getJob = function(fn){
  var self = this;

  // alloc a client for this job type
  var client = clients[self.type]
    || (clients[self.type] = redis.createClient());

  // FIX: Ensure the correct database is selected
  // Note: We added selected_db when originally connecting 
  client.select(client.selected_db, function () {
    // BLPOP indicates we have a new inactive job to process
    client.blpop('q:' + self.type + ':jobs', 0, function(err) {
      // FIX: BLPOP may return an error and further commands should not be attempted.
      if (err) 
        return fn(err);

      self.zpop('q:jobs:' + self.type + ':inactive', function(err, id){
        if (err) return fn(err);
        if (!id) return fn();
        Job.get(id, fn);
      });
    });
  });

This is not the best place for this logic, I'm assuming we'd want to make the change in the core reconnect logic and ensure it does not execute the BLPOP until we're sure the database has been selected, however we have had this fix in place for several weeks and things are looking much better for us.

We continue to have a watchdog to fix the indexes, however we're observing to see if that issue is related to the selected db issue on reconnect as well.

mfrobben commented 11 years ago

@mikemoser BLPOP has a timeout parameter - just set to a nonzero value and create a long polling loop rather than use setTimeout. http://redis.io/commands/blpop. How's the watchdog fix been working for you? We're just now running into this in our app and I'm debating whether to fight this bug or move to another library while we're small and have time.

mikemoser commented 11 years ago

@mfrobben we're back to using the blpop as a long poll (see latest code above). We're still keeping an eye on the watchdog, we have some diagnostics that output to the log when the index was off and need correction and we see that fire from time to time, however we still see jobs not processing every once and we're not sure why just yet. We are using a non default database on redis and that was the reason BLOP seemed not to be working correctly and we are investigating if that's why the indexes are sometimes out of wack as well. Our kue is working much better now, however we still can't sleep well at night knowing that some jobs could get stuck at anytime.

Can you describe the problems you're seeing?

mfrobben commented 11 years ago

Same inactive jobs problem that the thread was first created with. A job will be listed as 'inactive' but the BLPOP either never fired or the server crashed, so the thread doesn't do anything until the next job comes in. Running lpush clears the queue, so I was thinking the watchdog makes enough sense as a solution.

We process shipments, email, push notifications, and payment transactions in background workers. Email and push are meh, but shipping and payment are ultra-critical codepath and having things invisibly 'stuck' creates unacceptable risk for our business IMO. I was under the impression that persistence and replay functionality of redis guaranteed we had recoverability for jobs after a crash, but given this bug, kue seems to be causing more issues in those situations than it solves. :|

This is our 3rd worker queue implementation in node. The only that has worked bug free so far is atomic database collection polling.

spollack commented 11 years ago

FWIW, and i'm sad to say this, we are giving up on kue because of these problems and moving to a postgres-backed queue.

mfrobben commented 11 years ago

@spollack My gut tells me the same. Ahh, the heartburn of being bleeding edge :\

dfoody commented 11 years ago

We're taking a different route at the moment. We're gradually rewriting some of the kue internals to be transactional. But, it's not far enough along yet to be sharable.

On Feb 15, 2013, at 12:22 PM, Matthew Robben notifications@github.com wrote:

@spollack My gut tells me the same. Ahh, the heartburn of being bleeding edge :\

— Reply to this email directly or view it on GitHub.

mikemoser commented 11 years ago

Is anyone else using a non-default db on redis with their kue? I ask b\c if you are still seeing indexes get out of sync and if you're using the default zero database, that means we can likely rule that out as the cause there. My current thinking is that they are not related, just interested to get some extra data from the group.

Jellyfrog commented 11 years ago

@spollack @mfrobben Me too, I'm trying to process >million jobs/day and unfortunately Kue is too unreliable. I really do like Kue, and I hope in a future it will be more bulletproof.

@spollack you don't happen to have some solution to share?

spollack commented 11 years ago

@Jellyfrog i'm finishing the implementation now -- i'm happy to share as soon as i get a bit more testing on it.

Jellyfrog commented 11 years ago

Great!

spollack commented 11 years ago

@Jellyfrog i've created gists with the key pieces of our postgres-backed queue, so you can get a sense of the direction we went.

https://gist.github.com/spollack/5000509 -- db schema https://gist.github.com/spollack/5000516 -- queue implementation https://gist.github.com/spollack/5000522 -- queue tests

The code was written relying on a bunch of other pieces of our environment (including streamline, the gesundheit/any-db stack, etc.) and so likely will not be directly applicable without some adaptation.

Its still quite new, but so far is looking good in testing, and we'll be rolling it into production soon. Feedback welcome on the design.

If there is sufficient interest, i can work to factor out some of the internal dependencies and set it up as a real project/package.

Jellyfrog commented 11 years ago

Looks promising! At least you should add it as a repo on your profile, feels bad to discuss it here!

spollack commented 11 years ago

@Jellyfrog good idea, i've set up a repo at https://github.com/spollack/pgqueue with the sample code (a bit cleaned up), and i've taken down the gists from before.

dfoody commented 11 years ago

For anyone that's interested we've published our changes to make the internals of kue transactional (or at least as close as you can get with redis) to https://github.com/dfoody/kue/tree/QoS A couple of notes on using it:

PerfectedApp commented 11 years ago

We thought that we had this problem, but in fact did not understand the limits of Redis. I'm sorry that I cannot give reference to the actual documentation that states this, but if you have any other communication to Redis on a connection that has pub/sub, you may run into problems. Try creating a separate connection for pub/sub that is isolated from any other requests to redis.

We've been running with this approach for over a month now without a single problem.

Hope this message helps someone! Cheers

bulkan commented 11 years ago

Any updates further updates on this issue ?

felixchan commented 11 years ago

Any updates here? My indexes are out of sync for no reason. Why are they out of sync?

scriby commented 11 years ago

I also ran into this problem.

splurgy-michi commented 11 years ago

bump

felixchan commented 10 years ago

Anybody got a fix for this? Jobs are randomly stuck in queue.