Automattic / kue

Kue is a priority job queue backed by redis, built for node.js.
http://automattic.github.io/kue
MIT License
9.46k stars 867 forks source link

Jobs stuck in inactive state #130

Closed mikemoser closed 10 years ago

mikemoser commented 12 years ago

Jobs get stuck in the inactive state fairly often for us. We noticed that the length of q:[type]:jobs is zero, even when there are inactive jobs of that type, so when getJob calls blpop, there is nothing to process.

It looks like this gets set when a job is saved and the state is set to inactive using lpush q:[type]:jobs 1. We're wondering if this is failing in some cases and once the count is off, jobs remain unprocessed.

Has anyone else seen this issue?

dfoody commented 11 years ago

@felixchan Hi Felix, this is a fundamental issue in the basic kue architecture - the updates to redis are not transactional, so any failure during the middle of an update leaves a half applied change to redis (the source of stuck jobs).

My branch of kue https://github.com/dfoody/kue/tree/QoS has a rewrite of the core of kue to be fully transactional. We've been running it in production (~50k jobs per day for ~6 months) with <5 total stuck jobs over that period (we used to get ~1-5 per day).

The downside is that - since it's so different than the baseline kue - it hasn't been merged in a very long time.

rosskukulinski commented 11 years ago

@dfoody Which versions of Node have you run your QoS with? I'm entertaining the idea of putting together 'kue2' which should probably have some of the capabilities you're already developed.

dfoody commented 11 years ago

@rosskukulinski We still run node 0.6.21 in production. That said, I don't think there will be an issue using it with a newer version of node, other than the Kue UI (since express/connect have changed a lot). So, you can probably take the back-end of Kue from our fork, and the UI from the new version of kue and they should fit together fairly easily.

rosskukulinski commented 11 years ago

@dfoody cool, sounds like something for me in the next week or two. I also wonder how many people use the Kue UI. I feel it should really be a separate package and I would probably split it out.

webjay commented 11 years ago

@dfoody @rosskukulinski Keep us updated on what happens. I'd like to stick with Kue.

felixchan commented 11 years ago

I realize that after installing redis on a different server AND disabling snapshots/append file, everything works. No problems with this anymore.

On Sun, Oct 27, 2013 at 5:34 PM, Jacob Friis Saxberg < notifications@github.com> wrote:

@dfoody https://github.com/dfoody @rosskukulinskihttps://github.com/rosskukulinskiKeep us updated on what happens. I'd like to stick with Kue.

— Reply to this email directly or view it on GitHubhttps://github.com/LearnBoost/kue/issues/130#issuecomment-27183721 .

rosskukulinski commented 11 years ago

Hi @webjay - After some soul searching, we are going to be moving to a 'real' job system based on AMQP. For us, we've fallen in love with RabbitMQ and will be moving over to that in production very soon. As such, we've dropped development on the Kue package. The major lack of leadership and response from the maintainer is frustrating and the poor code quality is simply too much for a small team like us to take on.

Kue was a great as a platform to get us going (especially because we use Redis elsewhere), but as you can see from the PR list -- the project is dead.

cc: @dfoody

behrad commented 11 years ago

We have currently no problems in production (a million jobs/day) using kue on redis 2.6, some pulls are merged by me in our fork (some bugs+some improving pulls+no breaking API changes), and I'm gonna push it in a few days! I'd like to hold the Kue project up & running. If anybody can help on writing test cases to automate and facilitate pull requests, we would have an active responsive fork of Kue!

@rosskukulinski Being on top of Redis, and no need for a RabbitMQ like thing is awesomeness of Kue! However one could write an AMQP backend adapter for Kue!? BUT; Kue is simple enough to be changed and modified, AND it should also remain simple!

behrad commented 11 years ago

@webjay @rosskukulinski @dfoody @felixchan @visionmedia @bulkan https://github.com/LearnBoost/kue/pull/256

manast commented 11 years ago

I wrote a new job manager that cares about atomicity still being based on redis, main feature missing is "priority", but for many this is not important: https://github.com/OptimalBits/bull

scriby commented 10 years ago

We were using mongodb anyway, so switched to https://github.com/scttnlsn/monq last month and haven't had a problem since. Not as full featured as kue, but seems to be pretty solid.

v4l3r10 commented 10 years ago

news? :)

behrad commented 10 years ago

@v4l3r10 have you tested `0.7.x' version?

kfatehi commented 10 years ago

@behrad this is still happening for us as of 0.7.5. The most recent item in the queue gets stuck as inactive and requires another job to get queued before it will take over being the role of being the stuck job... Restarting the node.js app doesn't force the job to run.

behrad commented 10 years ago

this is still happening for us as of 0.7.5

Can you tell about the deterministic case or situation when it happens for you? or an example code which produces it?

Have you also tested the QOS branch?

kfatehi commented 10 years ago

We have not tested the QOS branch.

Our usage is a simple email job creating with jobs.create, and a worker with jobs.process -- the bug seemed to go away and then came back, it's hard to track down -- I've been watching this issue for months. I was using some of your earlier work on Kue (before your big PR got merged in) which seemed to resolve it, but the bug is back now.

My colleague and I have decided to remove Kue asap considering that the fact that it has no unit tests is really making it out to be a wildcard compared to the rest of our system that is tested.

behrad commented 10 years ago

Have you changed your deployment? (redis version? redis is local/remote to kue?,...) Is it related to your input traffic!? Some see that this happens when Redis hardly crashes, have you got any errors there when it happens!? I am eager to resolve this issue. If you can help me find the point of happening, we can put a workaround there to help stuck jobs out of inactive state. How often it happens?

and about tests, I haven't had enough time to write a complete suite, but I've created the issue to write them. any one interested can help to improve Kue.

manast commented 10 years ago

@keyvanfatehi take a look att Bull job manager, https://github.com/OptimalBits/bull we have used it for months without any problems

kfatehi commented 10 years ago

@manast we are looking at bull later this week, thanks man, at first glance it looks great.

@behrad yes I would really like to find out as well, I will do my best to triage Kue a but before I rip it out of our stack and update here if I find anything.

brunocasado commented 10 years ago

Same here guys. I'm testing my sent emails, but some times i get a stuck job.

PS: 0.7.5

behrad commented 10 years ago

I'm planning to refactor Kue's built-in zpop implementation to redis's atomic lbpoprpush I need your help testing it when finishes

manast commented 10 years ago

@behrad that is not so easy because kue provides a priority queue, and lbpoproush is for standard lists...

behrad commented 10 years ago

I am thinking of using redis SORT...STORE... lists on the fly @manast

behrad commented 10 years ago

Another workaround on top of current version is to write a simple fixing monitor process for inactive job indexes. It can poll each 5secs, figure stuck jobs out and fix indexes if necessary

behrad commented 10 years ago

Until we can get into the atomic brpoplpush + sorted lists implementation, I've done minor changes to the sensitive core of worker#getJob and job.setState which is available in experimental branch. That code still doesn't contain any watchdogs to fix stuck jobs on the fly, to be able to determine if they still happen with this patch.

Would anybody interested please test it and let me know of results?

git clone -b experimental https://github.com/LearnBoost/kue.git

behrad commented 10 years ago

@brunocasado @keyvanfatehi @v4l3r10 @webjay @scriby ... anybody wanna test that branch!?

ericz commented 10 years ago

I'm seeing exactly @keyvanfatehi's problem.

@behrad do I need to run the experimental branch on all kue clients or can I get away with doing it on just the ones creatings jobs / or processing jobs.

behrad commented 10 years ago

@ericz on both, however workers processing jobs are more important. BUT please don't run against experimental branch, since I've found a bug in it today, which I fixed and will commit into master branch as 0.7.6 until tomorrow. Please be ready to test in day or two and lemme know the results about 0.7.6 ;)

ericz commented 10 years ago

KK, will wait on 0.7.6

On Sat, Apr 26, 2014 at 10:07 AM, Behrad notifications@github.com wrote:

@ericz https://github.com/ericz on both, however workers processing jobs are more important. BUT please don't run against experimental branch, since I've found a bug in it today, which I fixed and will commit into master branch as 0.7.6 until tomorrow. Please be ready to test in day or two and lemme know the results about 0.7.6 ;)

— Reply to this email directly or view it on GitHubhttps://github.com/LearnBoost/kue/issues/130#issuecomment-41474309 .

510-691-3951 http://ericzhang.com

behrad commented 10 years ago

@ericz @keyvanfatehi I pushed 0.7.6 pre release into develop branch, Please test via git clone -b develop https://github.com/LearnBoost/kue.git

ericsaboia commented 10 years ago

I'm using this workaround to expire my stuck jobs:

jobs.process(key, concurrency, handlerWithExpiration(handler));

function handlerWithExpiration (handler) {
  return function (job, done) {
    console.log("Job %s:%s created", job.id, job.type);

    var expire = setTimeout(function () {
      done('Automatically Expired after 180s');
    }, 180000);

    handler(job, function (err) {
      clearTimeout(expire);
      done(err);
    });
  }
}
behrad commented 10 years ago

@ericsaboia R u still having stuck jobs with kue 0.7.7 !?

brunocasado commented 10 years ago

@behrad in some days i'll test again. This bug that you find is related with stuck?

EDIT: I'll try do some tests today. Maybe do a simple stress test with setTimeout Very sorry for late answer.

dfoody commented 10 years ago

FYI, the QoS branch has a built-in "watchdog" capability that can auto-restart jobs that are stuck.

Of course, the QoS branch isn't really subject to kue causing stuck jobs (since the core was rebuilt for atomicity and consistency versus the kue master) - but the watchdog itself can still help if your jobs themselves have internal issues that cause them to get stuck.

behrad commented 10 years ago

We are talking about jobs being stuck in inactive state, not active jobs. It seems you are talking about active jobs!?

I did a patch containing some small fixes to job state changes, and worker job poll. Those may reduce the probability of jobs being stuck in inactive which is reported to happen with unstable remote redis deployments/hostings. That will help until we change to prpoplpush + some server-side lua in later versions.

behrad commented 10 years ago

@ericsaboia and about the active jobs, this is happening when your worker process encounters un-handled exceptions and exits abnormally. These as @dfoody said can be resolved with a simple watchdog process and removed, or we can add a job TTL implementation into Kue.

ericsaboia commented 10 years ago

@behrad I know that the issue is about inactive state, but sometimes, when my jobs get stuck in an active statue, all other inactive jobs of the same type get stuck as well due to concurency limit. When it happens, the inactive jobs are still not processed even if I remove the active jobs that had been stucked.

ericsaboia commented 10 years ago

@behrad It would be awesome to see the TTL's implementation into Kue! I have hundreds of thousands jobs being processed everyday, and a dozen of developers creating new handlers all the time. So it is almost impossible to handle all exceptions to call the done callback.

behrad commented 10 years ago

@ericsaboia This problem is by your app design. You should path them a convention or boilerplate code to develop robust workers. you can use nodejs domains or uncaughtException events or error event in EventEmitter to properly handle errors in Nodejs

shaunc commented 10 years ago

To cure jobs stuck w/o calling "done" -- if you are using promises (or bluebird, at least -- haven't checked if this is promise spec compliant), you can wrap like:

jobs.constructor.prototype.processJob = (name, proc)->
  jobs.process(name, (job, done)->
    Promise.method(proc)(job).nodeify(done)
  )
ankitpatial commented 10 years ago

Any progress on this issue ? i am using kue 0.8.1 and for me few of the jobs just stuck in active mode.

behrad commented 10 years ago

@ankitpatial

  1. This issue relates to jobs stuck in inactive mode, not active
  2. Your worker's code is responsible for proper error handling to ensure calling done otherwise jobs will stay in active state
  3. If you think your problem is something else please create a new issue providing details.
ankitpatial commented 10 years ago

@behrad thanks for quick reply, i see few task just freeze in active queue, there is no exception i see in app log, all seems fine. Not seems my code issue, but i will give it a try, will debug my code for any possible hidden crash.

behrad commented 10 years ago

Ensure your done is not swallowed by any errors, or your logical code path. If you found something, please create a new issue.

behrad commented 10 years ago

@shaunc that (promises) does not help with handling client's async operation errors.

Climax777 commented 10 years ago

Just a quick question. Could this be related to an underlying network problem? This may be completely unrelated, but we've recently discovered that EC2 discards some TCP traffic due to a Xen issue.

behrad commented 10 years ago

@Climax777 Yes, your redis connection may be interrupted in the middle of anything...

Climax777 commented 10 years ago

@behrad Take a look at my blog post concerning the dropped TCP packets. This may help some deployments with issues. blog post

ganziganzi commented 10 years ago

I am using kue@0.8.3 my job maker and worker are the same node process. job maker will generate job in rand(0, 10) seconds, but I find sometimes if the delay time is too short, such as less than 1 sec, the worker will not be invoked immediately. From the UI tool I can see that the stuck jobs are in Queued column, inactive state. when I restart my node process all the stuck jobs will be processed at one time.

By using the following method I can work around this problem:

  1. q.promote(1000); //check every second
  2. change my job delay time to rand(2, 10) after this, everything seems to be ok.

But still, I found another problem: start my kue process and then restart redis-server. and then any new generated job will be stuck until I restart my kue process.

behrad commented 10 years ago

@ganziganzi promote should have nothing to do with being stuck in inactive state. Your workers active jobs may be stuck! Please check if you have stuck active jobs first. they won't allow other inactive jobs to come in workers. I should first understand what exactly is your problem, then you should provide me your code snippet which produces the same results as you said, so that I can debug it