Closed denizdogan closed 6 years ago
You cannot use stalledInterval 0. Btw, it is not recommended to modify the internal settings of the queue. It is only in very particular cases that it makes sense to do it. What are you tying to achieve?
@manast I had some other issues which turned out to be unrelated to this library and I was trying to fix them by setting stalledInterval
to 0, the reference says:
use 0 for never checking
My application works without modifying the settings now, so you can close the issue or make the issue about updating the docs :) Thanks
ok. Then it is a bug :)
I tried to reproduce this issue without success. Furthermore I doubled checked the code and it is indeed checking that stalledInterval must be larger than zero: https://github.com/OptimalBits/bull/blob/master/lib/queue.js#L909-L917
Ok, I will see if I can reproduce it again in the coming week, otherwise we can probably just close this :)
Sent from my iPhone
On 10 Mar 2018, at 15:02, Manuel Astudillo notifications@github.com wrote:
I tried to reproduce this issue without success. Furthermore I doubled checked the code and it is indeed checking that stalledInterval must be larger than zero: https://github.com/OptimalBits/bull/blob/master/lib/queue.js#L909-L917
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
FYI this error happened to me today, as well. I tried setting 0
for the first time, and this was the result. It was 100% reproduceable for me. I worked around it by setting my stalledInterval:
to a ridiculously high number (one year, in milliseconds). It must be a bug. You should update the docs to stop recommending 0
as a valid value. I am on "bull": "^3.10.0"
.
UPDATE: Maybe its because I have two processes--a producer and a consumer--each sharing a queue, and I am trying to set the stalledInterval
to be 0
on the consumer, and 2 minutes
on the producer.
UPDATE: I now realize this is not how Bull producer/consumers work. The stalledInterval
value is stored in redis as a key stalled-check
. There can only be one, no matter how many producers/consumers are sharing the queue. Only consumers can set this value; producers do not even check for stalled. The first consumer to write the value wins.
For that matter, producers also do not check for or receive 'failed' or 'stalled' events; only consumers can.
A consumer can be defined as a process that has defined a Queue#process()
callback/handler.
A producer is the one calling Queue#add()
.
After extensive testing, I'll share the observations I documented about Bull's quirky behavior, and corrections to the docs, in case it helps anyone else.
const queueOpts = { // QueueOptions
prefix: 'bull',
redis: { /* ... */ },
// don't enable RateLimiter until/unless you understand this:
// https://github.com/OptimalBits/bull/issues/1148
// limiter: { // RateLimiter
// // don't process more than 10 jobs a second;
// // that's ridiculous and likely an error
// // causing flooding of external i/o services
// max: 10,
// duration: 1000,
// bounceBack: true,
// },
settings: { // AdvancedSettings
// Key expiration time for job locks.
// This should be set to 2x as often as you would like the
// consumer to notify the queue that it is still alive and processing a job.
// Actually, the maximum time it will take for a job to be marked stalled/failed is based on:
// a) is stalled-check defined? if not, wait until next consumer checks and restores it, which begins the stalledInterval countdown
// b) is the job marked as stalled? if not, wait until lockDuration for lock to expire, and up to the maximum stalledInterval for a consumer to notice
// lockDuration: 2*60*1000, // 2m in ms (default: 30sec)
// Interval in ms to renew job lock. This single mechanism is what tracks/prevents stalling.
// If the process is hung/frozen or not alive anymore to perform this setInterval lock EXPIRES renewal
// then the job moves to stalled and then to failed (if no retries left) on next stalledInterval.
// lockRenewTime: // (default: lockDuration / 2)
// How often check for stalled jobs.
// Where "stalled" is defined as: jobs in active list whose lock has EXPIREd--consumer failed to renew.
// NOTICE: Only the queue .process() thread will perform/handle the stalled check events.
// Meaning at least one other consumer needs to be alive to snitch on and handle for another failed consumer.
// WARNING: The docss say "Use 0 to disable checking" but this is incorrect and will result in an error.
// Instead, if you want to disable, just make the interval a ridiculously long time like 999 years.
// stalledInterval: 2*60*1000, // (default: 60sec)
// The maximum number of times a job can be restarted before it will be moved to 'failed' state.
// NOTICE: If you set this to 0, it will NOT trigger 'stalled' event, not retry, move job to 'failed' state,
// and trigger 'failed' event--and the error will say "job stalled more than allowable limit".
// If you set this to 1, it WILL trigger 'stalled' event, retry, then if it stalls a second time,
// it will behave like 0 above from there.
// NOTICE: If consumer1 dies, consumer2 will find and handle the stalled or failed events.
// It's actually a random lottery to receive event between consumers alive when stalled-check expires.
// This also means only one consumer receives a copy of each event--not all consumers.
// NOTICE: If stall check times out but no consumer is alive,
// job will sit in queue until next consumer comes online,
// then that consumer will get the event.
// NOTICE: Only a consumer having registered .process() can receive any events.
// A producer therefore receives no events.
// NOTICE: when job has .timeout: option set, and it times-out, it becomes failed, not stalled.
// NOTICE: the only way the ui producer can learn about job failures is if a surviving ua consumer catches
// and forwards a separate jobFailed message back to the ui.
maxStalledCount: 0, // WARNING: job will be retried if > 0
// Poll interval for delayed jobs and added jobs.
// spread between 5-10 sec (since UA server runs in large cluster)
guardInterval: (5*1000) + Math.ceil(Math.random()*5*1000),
// delay before processing next job in case of internal error.
retryProcessDelay: 60*1000, // one minute
// A timeout for when the queue is in drained state (empty waiting for jobs).
drainDelay: 5,
},
// see also: https://github.com/OptimalBits/bull/blob/master/REFERENCE.md#queueadd
defaultJobOptions: { // JobOpts
// the manual discourages use of this feature,
// because it comes with performance penalties,
// so we won't be using it.
// priority:
// Milliseconds to wait until this job can be processed by a consumer.
// By default we want all messages delivered immediately.
// delay: // (optional; no default)
// Number of times to try until a job completes.
// By default we only one one try, followed by zero retries.
attempts: 1,
// this should be determined by each job, or not used
//repeat:
backoff: {
type: 'exponential',
delay: 60*1000, // one minute
},
// process youngest requests first
lifo: true,
// we allow jobs that are continuing to renew their lock to run forever
// unless a timeout is specifically configured by the end-user, on a per-job basis
// timeout: // (optional; no default)
// this should be determined by each job
// jobId:
// this is convenient but it also means you can't query which job from 'completed' event.
// since we don't even listen for the event, its fine to let this be automatic.
removeOnComplete: true,
// this would be convenient but it also means you can't query which job from 'failed' event.
// also, when stalls happen, the job gets moved to 'failed' state but does NOT honor removeOnFail in that case.
// so it is easier to let our event handler remove the job.
removeOnFail: false,
stackTraceLimit: 10, // reasonable default
},
};
Decoder ring: UA
= consumer, one of many in our system. UI
= producer, only one in our system. Each is its own process. This is for the queue for sending from UI->UA
.
I get the same error:
{
"stack": "ReplyError: ERR Error running script (call to f_d4d1d1b62afa6d576d81cfbecc482ad7e516d172): @user_script:30: ERR invalid expire time in set
at Object.moveUnlockedJobsToWait (/path/to/script/node_modules/bull/lib/scripts.js:380:25)
at Queue.moveUnlockedJobsToWait (/path/to/script/node_modules/bull/lib/queue.js:862:6)
at isReady.then (/path/to/script/node_modules/bull/lib/queue.js:789:19)
at <anonymous>
at process._tickCallback (internal/process/next_tick.js:188:7)",
"message": "ERR Error running script (call to f_d4d1d1b62afa6d576d81cfbecc482ad7e516d172): @user_script:30: ERR invalid expire time in set ",
"command": {}
}
I don't event need to create a producer script to recreate the error message, I have just run the following code:
const Bull = require('bull');
const queue = new Bull("html_pusher", {
settings: {
stalledInterval: 0
},
redis:
{ "port": 6379, "host": "10.8.0.12", "db": 3, "showFriendlyErrorStack": true, "password": xxx}
});
queue.process(1, (job, done) => {
console.log(process.pid + ` master eseguo ${job.id} `, new Date());
setTimeout(done, 3000);
});
queue.on('error', (err) => {
console.log(process.pid + ` master error ${err.message}`)
});
queue.on('failed', function (job, err) {
console.log(process.pid + ` master failed ${job.id} ${err.message}`)
});
queue.on('stalled', function (job) {
console.log(process.pid + ` master stalled ${job.id}`)
});
queue.on('completed', async (job, result) => {
console.log(process.pid + ` master completed ${job.id}`)
});
bull version : 3.10.0
Same problem...
ERROR { ReplyError: ERR Error running script (call to f_d4d1d1b62afa6d576d81cfbecc482ad7e516d172): @user_script:80: user_script:80: too many results to unpack at parseError (/root/node_modules/ioredis/node_modules/redis-parser/lib/parser.js:179:12) at parseType (/root/node_modules/ioredis/node_modules/redis-parser/lib/parser.js:302:14) command: { name: 'evalsha', args: [ 'd4d1d1b62afa6d576d81cfbecc482ad7e516d172', '7', 'bull:docheap:stalled', 'bull:docheap:wait', 'bull:docheap:active', 'bull:docheap:failed', 'bull:docheap:stalled-check', 'bull:docheap:meta-paused', 'bull:docheap:paused', '1', 'bull:docheap:', '1563758306417', '30000' ] } }
We've seen this in a few environments - but only queues with a limiter applied.
@gmcnaught seems like there is an issue when there are a lot of rate limited jobs, will take a look.
also facing same issue prod server gone down. i have migrated queues to agendajs because of this. love to hear back
Description
Pretty-printed error:
Test code to reproduce
Bull version
3.3.10
Additional information
This came seemingly out of nowhere and happens as soon as I start my application.