Closed mikemoser closed 10 years ago
@felixchan Hi Felix, this is a fundamental issue in the basic kue architecture - the updates to redis are not transactional, so any failure during the middle of an update leaves a half applied change to redis (the source of stuck jobs).
My branch of kue https://github.com/dfoody/kue/tree/QoS has a rewrite of the core of kue to be fully transactional. We've been running it in production (~50k jobs per day for ~6 months) with <5 total stuck jobs over that period (we used to get ~1-5 per day).
The downside is that - since it's so different than the baseline kue - it hasn't been merged in a very long time.
@dfoody Which versions of Node have you run your QoS with? I'm entertaining the idea of putting together 'kue2' which should probably have some of the capabilities you're already developed.
@rosskukulinski We still run node 0.6.21 in production. That said, I don't think there will be an issue using it with a newer version of node, other than the Kue UI (since express/connect have changed a lot). So, you can probably take the back-end of Kue from our fork, and the UI from the new version of kue and they should fit together fairly easily.
@dfoody cool, sounds like something for me in the next week or two. I also wonder how many people use the Kue UI. I feel it should really be a separate package and I would probably split it out.
@dfoody @rosskukulinski Keep us updated on what happens. I'd like to stick with Kue.
I realize that after installing redis on a different server AND disabling snapshots/append file, everything works. No problems with this anymore.
On Sun, Oct 27, 2013 at 5:34 PM, Jacob Friis Saxberg < notifications@github.com> wrote:
@dfoody https://github.com/dfoody @rosskukulinskihttps://github.com/rosskukulinskiKeep us updated on what happens. I'd like to stick with Kue.
— Reply to this email directly or view it on GitHubhttps://github.com/LearnBoost/kue/issues/130#issuecomment-27183721 .
Hi @webjay - After some soul searching, we are going to be moving to a 'real' job system based on AMQP. For us, we've fallen in love with RabbitMQ and will be moving over to that in production very soon. As such, we've dropped development on the Kue package. The major lack of leadership and response from the maintainer is frustrating and the poor code quality is simply too much for a small team like us to take on.
Kue was a great as a platform to get us going (especially because we use Redis elsewhere), but as you can see from the PR list -- the project is dead.
cc: @dfoody
We have currently no problems in production (a million jobs/day) using kue on redis 2.6, some pulls are merged by me in our fork (some bugs+some improving pulls+no breaking API changes), and I'm gonna push it in a few days! I'd like to hold the Kue project up & running. If anybody can help on writing test cases to automate and facilitate pull requests, we would have an active responsive fork of Kue!
@rosskukulinski Being on top of Redis, and no need for a RabbitMQ like thing is awesomeness of Kue! However one could write an AMQP backend adapter for Kue!? BUT; Kue is simple enough to be changed and modified, AND it should also remain simple!
@webjay @rosskukulinski @dfoody @felixchan @visionmedia @bulkan https://github.com/LearnBoost/kue/pull/256
I wrote a new job manager that cares about atomicity still being based on redis, main feature missing is "priority", but for many this is not important: https://github.com/OptimalBits/bull
We were using mongodb anyway, so switched to https://github.com/scttnlsn/monq last month and haven't had a problem since. Not as full featured as kue, but seems to be pretty solid.
news? :)
@v4l3r10 have you tested `0.7.x' version?
@behrad this is still happening for us as of 0.7.5. The most recent item in the queue gets stuck as inactive and requires another job to get queued before it will take over being the role of being the stuck job... Restarting the node.js app doesn't force the job to run.
this is still happening for us as of 0.7.5
Can you tell about the deterministic case or situation when it happens for you? or an example code which produces it?
Have you also tested the QOS branch?
We have not tested the QOS branch.
Our usage is a simple email job creating with jobs.create
, and a worker with jobs.process
-- the bug seemed to go away and then came back, it's hard to track down -- I've been watching this issue for months. I was using some of your earlier work on Kue (before your big PR got merged in) which seemed to resolve it, but the bug is back now.
My colleague and I have decided to remove Kue asap considering that the fact that it has no unit tests is really making it out to be a wildcard compared to the rest of our system that is tested.
Have you changed your deployment? (redis version? redis is local/remote to kue?,...) Is it related to your input traffic!? Some see that this happens when Redis hardly crashes, have you got any errors there when it happens!? I am eager to resolve this issue. If you can help me find the point of happening, we can put a workaround there to help stuck jobs out of inactive state. How often it happens?
and about tests, I haven't had enough time to write a complete suite, but I've created the issue to write them. any one interested can help to improve Kue.
@keyvanfatehi take a look att Bull job manager, https://github.com/OptimalBits/bull we have used it for months without any problems
@manast we are looking at bull later this week, thanks man, at first glance it looks great.
@behrad yes I would really like to find out as well, I will do my best to triage Kue a but before I rip it out of our stack and update here if I find anything.
Same here guys. I'm testing my sent emails, but some times i get a stuck job.
PS: 0.7.5
I'm planning to refactor Kue
's built-in zpop implementation to redis's atomic lbpoprpush
I need your help testing it when finishes
@behrad that is not so easy because kue provides a priority queue, and lbpoproush is for standard lists...
I am thinking of using redis SORT...STORE...
lists on the fly @manast
Another workaround on top of current version is to write a simple fixing monitor process for inactive
job indexes. It can poll each 5secs, figure stuck jobs out and fix indexes if necessary
Until we can get into the atomic brpoplpush
+ sorted lists
implementation, I've done minor changes to the sensitive core of worker#getJob
and job.setState
which is available in experimental branch. That code still doesn't contain any watchdogs to fix stuck jobs on the fly, to be able to determine if they still happen with this patch.
Would anybody interested please test it and let me know of results?
git clone -b experimental https://github.com/LearnBoost/kue.git
@brunocasado @keyvanfatehi @v4l3r10 @webjay @scriby ... anybody wanna test that branch!?
I'm seeing exactly @keyvanfatehi's problem.
@behrad do I need to run the experimental branch on all kue clients or can I get away with doing it on just the ones creatings jobs / or processing jobs.
@ericz on both, however workers processing jobs are more important. BUT please don't run against experimental branch, since I've found a bug in it today, which I fixed and will commit into master branch as 0.7.6 until tomorrow. Please be ready to test in day or two and lemme know the results about 0.7.6 ;)
KK, will wait on 0.7.6
On Sat, Apr 26, 2014 at 10:07 AM, Behrad notifications@github.com wrote:
@ericz https://github.com/ericz on both, however workers processing jobs are more important. BUT please don't run against experimental branch, since I've found a bug in it today, which I fixed and will commit into master branch as 0.7.6 until tomorrow. Please be ready to test in day or two and lemme know the results about 0.7.6 ;)
— Reply to this email directly or view it on GitHubhttps://github.com/LearnBoost/kue/issues/130#issuecomment-41474309 .
510-691-3951 http://ericzhang.com
@ericz @keyvanfatehi I pushed 0.7.6 pre release into develop
branch, Please test via
git clone -b develop https://github.com/LearnBoost/kue.git
I'm using this workaround to expire my stuck jobs:
jobs.process(key, concurrency, handlerWithExpiration(handler));
function handlerWithExpiration (handler) {
return function (job, done) {
console.log("Job %s:%s created", job.id, job.type);
var expire = setTimeout(function () {
done('Automatically Expired after 180s');
}, 180000);
handler(job, function (err) {
clearTimeout(expire);
done(err);
});
}
}
@ericsaboia R u still having stuck jobs with kue 0.7.7
!?
@behrad in some days i'll test again. This bug that you find is related with stuck?
EDIT: I'll try do some tests today. Maybe do a simple stress test with setTimeout Very sorry for late answer.
FYI, the QoS branch has a built-in "watchdog" capability that can auto-restart jobs that are stuck.
Of course, the QoS branch isn't really subject to kue causing stuck jobs (since the core was rebuilt for atomicity and consistency versus the kue master) - but the watchdog itself can still help if your jobs themselves have internal issues that cause them to get stuck.
We are talking about jobs being stuck in inactive
state, not active jobs. It seems you are talking about active jobs!?
I did a patch containing some small fixes to job state changes, and worker job poll. Those may reduce the probability of jobs being stuck in inactive
which is reported to happen with unstable remote redis deployments/hostings.
That will help until we change to prpoplpush
+ some server-side lua in later versions.
@ericsaboia and about the active jobs, this is happening when your worker process encounters un-handled exceptions and exits abnormally. These as @dfoody said can be resolved with a simple watchdog process and removed, or we can add a job TTL
implementation into Kue.
@behrad I know that the issue is about inactive state, but sometimes, when my jobs get stuck in an active statue, all other inactive jobs of the same type get stuck as well due to concurency limit. When it happens, the inactive jobs are still not processed even if I remove the active jobs that had been stucked.
@behrad It would be awesome to see the TTL's implementation into Kue! I have hundreds of thousands jobs being processed everyday, and a dozen of developers creating new handlers all the time. So it is almost impossible to handle all exceptions to call the done
callback.
@ericsaboia This problem is by your app design. You should path them a convention or boilerplate code to develop robust workers. you can use nodejs domains or uncaughtException events or error
event in EventEmitter to properly handle errors in Nodejs
To cure jobs stuck w/o calling "done" -- if you are using promises (or bluebird, at least -- haven't checked if this is promise spec compliant), you can wrap like:
jobs.constructor.prototype.processJob = (name, proc)->
jobs.process(name, (job, done)->
Promise.method(proc)(job).nodeify(done)
)
Any progress on this issue ? i am using kue 0.8.1 and for me few of the jobs just stuck in active mode.
@ankitpatial
inactive
mode, not active
done
otherwise jobs will stay in active
state@behrad thanks for quick reply, i see few task just freeze in active queue, there is no exception i see in app log, all seems fine. Not seems my code issue, but i will give it a try, will debug my code for any possible hidden crash.
Ensure your done
is not swallowed by any errors, or your logical code path. If you found something, please create a new issue.
@shaunc that (promises) does not help with handling client's async operation errors.
Just a quick question. Could this be related to an underlying network problem? This may be completely unrelated, but we've recently discovered that EC2 discards some TCP traffic due to a Xen issue.
@Climax777 Yes, your redis connection may be interrupted in the middle of anything...
@behrad Take a look at my blog post concerning the dropped TCP packets. This may help some deployments with issues. blog post
I am using kue@0.8.3 my job maker and worker are the same node process. job maker will generate job in rand(0, 10) seconds, but I find sometimes if the delay time is too short, such as less than 1 sec, the worker will not be invoked immediately. From the UI tool I can see that the stuck jobs are in Queued column, inactive state. when I restart my node process all the stuck jobs will be processed at one time.
By using the following method I can work around this problem:
But still, I found another problem: start my kue process and then restart redis-server. and then any new generated job will be stuck until I restart my kue process.
@ganziganzi promote
should have nothing to do with being stuck in inactive
state. Your workers active jobs may be stuck! Please check if you have stuck active jobs first. they won't allow other inactive jobs to come in workers. I should first understand what exactly is your problem, then you should provide me your code snippet which produces the same results as you said, so that I can debug it
Jobs get stuck in the inactive state fairly often for us. We noticed that the length of
q:[type]:jobs
is zero, even when there are inactive jobs of that type, so when getJob callsblpop
, there is nothing to process.It looks like this gets set when a job is saved and the state is set to inactive using
lpush q:[type]:jobs 1
. We're wondering if this is failing in some cases and once the count is off, jobs remain unprocessed.Has anyone else seen this issue?