OptimalBits / bull

Premium Queue package for handling distributed jobs and messages in NodeJS.
Other
15.56k stars 1.43k forks source link

ERR Error running script (call to f_...): @user_script:30: ERR invalid expire time in set #875

Closed denizdogan closed 6 years ago

denizdogan commented 6 years ago

Description

Pretty-printed error:

{
  "name": "ReplyError",
  "message":
    "ERR Error running script (call to f_d4d1d1b62afa6d576d81cfbecc482ad7e516d172): @user_script:30: ERR invalid expire time in set ",
  "command": {
    "name": "evalsha",
    "args": [
      "d4d1d1b62afa6d576d81cfbecc482ad7e516d172",
      "7",
      "bull:deniz:stalled",
      "bull:deniz:wait",
      "bull:deniz:active",
      "bull:deniz:failed",
      "bull:deniz:stalled-check",
      "bull:deniz:meta-paused",
      "bull:deniz:paused",
      "1",
      "bull:deniz:",
      "1519203661760",
      "0"
    ]
  }
}

Test code to reproduce

const queue = new Queue('deniz', process.env.REDIS_URL, {
  settings: {
    stalledInterval: 0
  }
})

Bull version

3.3.10

Additional information

This came seemingly out of nowhere and happens as soon as I start my application.

manast commented 6 years ago

You cannot use stalledInterval 0. Btw, it is not recommended to modify the internal settings of the queue. It is only in very particular cases that it makes sense to do it. What are you tying to achieve?

denizdogan commented 6 years ago

@manast I had some other issues which turned out to be unrelated to this library and I was trying to fix them by setting stalledInterval to 0, the reference says:

use 0 for never checking

My application works without modifying the settings now, so you can close the issue or make the issue about updating the docs :) Thanks

manast commented 6 years ago

ok. Then it is a bug :)

manast commented 6 years ago

I tried to reproduce this issue without success. Furthermore I doubled checked the code and it is indeed checking that stalledInterval must be larger than zero: https://github.com/OptimalBits/bull/blob/master/lib/queue.js#L909-L917

denizdogan commented 6 years ago

Ok, I will see if I can reproduce it again in the coming week, otherwise we can probably just close this :)

Sent from my iPhone

On 10 Mar 2018, at 15:02, Manuel Astudillo notifications@github.com wrote:

I tried to reproduce this issue without success. Furthermore I doubled checked the code and it is indeed checking that stalledInterval must be larger than zero: https://github.com/OptimalBits/bull/blob/master/lib/queue.js#L909-L917

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

ancms2600 commented 5 years ago

FYI this error happened to me today, as well. I tried setting 0 for the first time, and this was the result. It was 100% reproduceable for me. I worked around it by setting my stalledInterval: to a ridiculously high number (one year, in milliseconds). It must be a bug. You should update the docs to stop recommending 0 as a valid value. I am on "bull": "^3.10.0".

UPDATE: Maybe its because I have two processes--a producer and a consumer--each sharing a queue, and I am trying to set the stalledInterval to be 0 on the consumer, and 2 minutes on the producer.

UPDATE: I now realize this is not how Bull producer/consumers work. The stalledInterval value is stored in redis as a key stalled-check. There can only be one, no matter how many producers/consumers are sharing the queue. Only consumers can set this value; producers do not even check for stalled. The first consumer to write the value wins.

For that matter, producers also do not check for or receive 'failed' or 'stalled' events; only consumers can. A consumer can be defined as a process that has defined a Queue#process() callback/handler. A producer is the one calling Queue#add().

After extensive testing, I'll share the observations I documented about Bull's quirky behavior, and corrections to the docs, in case it helps anyone else.

    const queueOpts =  { // QueueOptions
        prefix: 'bull',

        redis: { /* ... */ },

        // don't enable RateLimiter until/unless you understand this:
        // https://github.com/OptimalBits/bull/issues/1148
        // limiter: { // RateLimiter
        //     // don't process more than 10 jobs a second;
        //     // that's ridiculous and likely an error
        //     // causing flooding of external i/o services
        //     max: 10,
        //     duration: 1000,
        //     bounceBack: true,
        // },

        settings: { // AdvancedSettings
            // Key expiration time for job locks.
            // This should be set to 2x as often as you would like the
            // consumer to notify the queue that it is still alive and processing a job.
            // Actually, the maximum time it will take for a job to be marked stalled/failed is based on:
            // a) is stalled-check defined? if not, wait until next consumer checks and restores it, which begins the stalledInterval countdown
            // b) is the job marked as stalled? if not, wait until lockDuration for lock to expire, and up to the maximum stalledInterval for a consumer to notice
            // lockDuration: 2*60*1000, // 2m in ms (default: 30sec)

            // Interval in ms to renew job lock. This single mechanism is what tracks/prevents stalling.
            // If the process is hung/frozen or not alive anymore to perform this setInterval lock EXPIRES renewal
            // then the job moves to stalled and then to failed (if no retries left) on next stalledInterval.
            // lockRenewTime: // (default: lockDuration / 2)

            // How often check for stalled jobs.
            // Where "stalled" is defined as: jobs in active list whose lock has EXPIREd--consumer failed to renew.
            // NOTICE: Only the queue .process() thread will perform/handle the stalled check events.
            //    Meaning at least one other consumer needs to be alive to snitch on and handle for another failed consumer.
            // WARNING: The docss say "Use 0 to disable checking" but this is incorrect and will result in an error.
            //     Instead, if you want to disable, just make the interval a ridiculously long time like 999 years.
            // stalledInterval: 2*60*1000, // (default: 60sec)

            // The maximum number of times a job can be restarted before it will be moved to 'failed' state.
            // NOTICE: If you set this to 0, it will NOT trigger 'stalled' event, not retry, move job to 'failed' state,
            //     and trigger 'failed' event--and the error will say "job stalled more than allowable limit".
            //     If you set this to 1, it WILL trigger 'stalled' event, retry, then if it stalls a second time,
            //     it will behave like 0 above from there.
            // NOTICE: If consumer1 dies, consumer2 will find and handle the stalled or failed events.
            //     It's actually a random lottery to receive event between consumers alive when stalled-check expires.
            //     This also means only one consumer receives a copy of each event--not all consumers.
            // NOTICE: If stall check times out but no consumer is alive,
            //     job will sit in queue until next consumer comes online,
            //     then that consumer will get the event.
            // NOTICE: Only a consumer having registered .process() can receive any events.
            //     A producer therefore receives no events.
            // NOTICE: when job has .timeout: option set, and it times-out, it becomes failed, not stalled.
            // NOTICE: the only way the ui producer can learn about job failures is if a surviving ua consumer catches
            //     and forwards a separate jobFailed message back to the ui.
            maxStalledCount: 0, // WARNING: job will be retried if > 0

            // Poll interval for delayed jobs and added jobs.
            // spread between 5-10 sec (since UA server runs in large cluster)
            guardInterval: (5*1000) + Math.ceil(Math.random()*5*1000),

            // delay before processing next job in case of internal error.
            retryProcessDelay: 60*1000, // one minute

            // A timeout for when the queue is in drained state (empty waiting for jobs).
            drainDelay: 5,
        },

        // see also: https://github.com/OptimalBits/bull/blob/master/REFERENCE.md#queueadd
        defaultJobOptions: { // JobOpts
            // the manual discourages use of this feature,
            // because it comes with performance penalties,
            // so we won't be using it.
            // priority:

            // Milliseconds to wait until this job can be processed by a consumer.
            // By default we want all messages delivered immediately.
            // delay: // (optional; no default)

            // Number of times to try until a job completes.
            // By default we only one one try, followed by zero retries.
            attempts: 1,

            // this should be determined by each job, or not used
            //repeat:

            backoff: {
                type: 'exponential',
                delay: 60*1000, // one minute
            },

            // process youngest requests first
            lifo: true,

            // we allow jobs that are continuing to renew their lock to run forever
            // unless a timeout is specifically configured by the end-user, on a per-job basis
            // timeout: // (optional; no default)

            // this should be determined by each job
            // jobId:

            // this is convenient but it also means you can't query which job from 'completed' event.
            // since we don't even listen for the event, its fine to let this be automatic.
            removeOnComplete: true,

            // this would be convenient but it also means you can't query which job from 'failed' event.
            // also, when stalls happen, the job gets moved to 'failed' state but does NOT honor removeOnFail in that case.
            // so it is easier to let our event handler remove the job.
            removeOnFail: false,

            stackTraceLimit: 10, // reasonable default
        },
    };

Decoder ring: UA = consumer, one of many in our system. UI = producer, only one in our system. Each is its own process. This is for the queue for sending from UI->UA.

Gappa88 commented 5 years ago

I get the same error:

{
  "stack": "ReplyError: ERR Error running script (call to f_d4d1d1b62afa6d576d81cfbecc482ad7e516d172): @user_script:30: ERR invalid expire time in set 
    at Object.moveUnlockedJobsToWait (/path/to/script/node_modules/bull/lib/scripts.js:380:25)
    at Queue.moveUnlockedJobsToWait (/path/to/script/node_modules/bull/lib/queue.js:862:6)
    at isReady.then (/path/to/script/node_modules/bull/lib/queue.js:789:19)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:188:7)",
  "message": "ERR Error running script (call to f_d4d1d1b62afa6d576d81cfbecc482ad7e516d172): @user_script:30: ERR invalid expire time in set ",
  "command": {}
}

I don't event need to create a producer script to recreate the error message, I have just run the following code:


const Bull = require('bull');
const queue = new Bull("html_pusher", {
  settings: {
    stalledInterval: 0
  },
  redis:
    { "port": 6379, "host": "10.8.0.12", "db": 3, "showFriendlyErrorStack": true, "password": xxx}
});

queue.process(1, (job, done) => {
  console.log(process.pid + ` master eseguo ${job.id} `, new Date());
  setTimeout(done, 3000);
});

queue.on('error', (err) => {
  console.log(process.pid + ` master error ${err.message}`)
});

queue.on('failed', function (job, err) {
  console.log(process.pid + ` master failed ${job.id} ${err.message}`)
});

queue.on('stalled', function (job) {
  console.log(process.pid + ` master stalled ${job.id}`)
});

queue.on('completed', async (job, result) => {
  console.log(process.pid + ` master completed ${job.id}`)
});

bull version : 3.10.0

dave7280 commented 5 years ago

Same problem...

ERROR { ReplyError: ERR Error running script (call to f_d4d1d1b62afa6d576d81cfbecc482ad7e516d172): @user_script:80: user_script:80: too many results to unpack at parseError (/root/node_modules/ioredis/node_modules/redis-parser/lib/parser.js:179:12) at parseType (/root/node_modules/ioredis/node_modules/redis-parser/lib/parser.js:302:14) command: { name: 'evalsha', args: [ 'd4d1d1b62afa6d576d81cfbecc482ad7e516d172', '7', 'bull:docheap:stalled', 'bull:docheap:wait', 'bull:docheap:active', 'bull:docheap:failed', 'bull:docheap:stalled-check', 'bull:docheap:meta-paused', 'bull:docheap:paused', '1', 'bull:docheap:', '1563758306417', '30000' ] } }

gmcnaught commented 4 years ago

We've seen this in a few environments - but only queues with a limiter applied.

manast commented 4 years ago

@gmcnaught seems like there is an issue when there are a lot of rate limited jobs, will take a look.

Robokishan commented 2 years ago

also facing same issue prod server gone down. i have migrated queues to agendajs because of this. love to hear back