OptimalBits / bull

Premium Queue package for handling distributed jobs and messages in NodeJS.
Other
15.42k stars 1.43k forks source link

REDIS Timeout Issue #1602

Open elucidsoft opened 4 years ago

elucidsoft commented 4 years ago

I have no idea why this happens, but it does. When the Redis server gets rebooted, and your Queue loses its connection it doesn't reconnect. We have discussed this before, you believe its an issue with IORedis. But I am not sure now, after spending several hours playing with this I can consistently repeat the issue now.

Some observations:

I created the following rather ugly, horribly ugly code for my healthcheck as a temp workaround until I can find the root cause of this issue.

const job = await this.queueService.queue.add(null, {removeOnComplete: true, delay: 9999}); await job.remove(); This code will ALWAYS throw an exception in this scenario, so its a pretty good check to see...

Environment:

Kubernetes cluster, I have tried the following environments and it seems to happen on each with different errors:

Single Redis Instance: Same behavior, but you get an ECCONREFUSED error from IORedis.

Sentinel Redis with 3 instances master/slave, if you kill all of them simultaneously you get ALL SENTINELS are down error from IORedis.

Both appear to behave exactly the same. If you reboot your app, everything works again.

elucidsoft commented 4 years ago

https://github.com/taskforcesh/bullmq/issues/83

stansv commented 4 years ago

Hi @elucidsoft , thanks for your investigation!

I believe you have got these results but I confused by the fact this is not exactly matches with what I observed before, when I've did my research.

I cannot explain why only add() method stops working. I suppose this can be related to lua script executed on Redis inside add(), and any other Bull api call using lua inside would fail.

The only issue I know in Bull is related to internal initializing promise. This is used to indicate that internal ioredis instance is created and initialized. It happens when all lua scripts are registered with defineCommand api method provided by ioredis. The problem is that if this promise becomes rejected, the Queue instance methods won't work anymore, since internally it always makes sure that initializing promise is resolved. However, if you try to execute any redis command manually via queue.client, for example, 'SADD' — it would resolve once Redis is up.

What I can suggest is to implement some retry logic when creating queue instance, it may look like this

const createQueue = async(name: string): Promise<Bull.Queue> => {
    let attempts = 100; // can be Infinity, but I'm not sure about memory leaks
    let queue;

    while(attempts-- > 0) {
        try {
            queue = new Bull(name);
            await queue.isReady();
            return queue;
        } catch(error) {
            // log error if you want
            if(queue) {
                try {
                    queue.close(true);
                    queue = undefined;
                } catch(_) {
                }
            }
            await new Promise((resolve) => { setTimeout(resolve, 1000); });
        }
    }
}; 
stansv commented 4 years ago

Also, can you please share more complete set of steps to reproduce that behavior observed by you? I didn't tried k8s, my redis is running in docker but node process is executed natively on host.

manast commented 4 years ago

so maybe the concept of initialising must be rethought in BullMQ, since when a redis instance is completely rebooted, all loaded commands must be reloaded. A better approach would be a lazy load of the commands, but then we may need to use our own code for handling commands than the one provided by ioredis.

elucidsoft commented 4 years ago

ahh I didn't think about recreating the Bull instance. This is a good solution, also is there an easier way to detect this scenario rather than calling add?

elucidsoft commented 4 years ago

@manast This behavior is also not fixed by using Sentinel either, when a redis instance goes down ioredis will switch to a new master via Sentinel. Which seems to work ok, until the original master comes back up and gets voted back to master then we start seeing same issue.

faller commented 4 years ago

I have encountered the timeout issue 2 times when master crashed with Sentinel Mode

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.