how do you guys scale Kue ?

Automattic / kue

Kue is a priority job queue backed by redis, built for node.js.

http://automattic.github.io/kue

MIT License

9.45k stars 865 forks source link

how do you guys scale Kue ? #1047

Open adalyz opened 7 years ago

adalyz commented 7 years ago

Hi Everyone, We have been using Kue for over 6 months now and with our growing needs, we need to distribute the workers over to multiple machines all connecting to a central redis.

We have been trying all the options but none seem to work so i guess it will be nice to understand how are you guys distributing the load of Kue workers to multiple machines.

Till now the biggest challenge is that when a worker from another machine connects to a central redis and whenever there is some problem with the connection it makes the whole kue unstable and the whole processing just stops.

We have tried reconnecting redis connections but it does not seem to work. We also tried to use twemproxy, but that too i guess does not work with Kue.

I can see issues open with Redis connection but we are unable to find a simple way to horizontally scale Kue, any help here will be much appreciated.

behrad commented 7 years ago

We are adding as many machines we want to a single redis! You should find out your redis connection issues! May be VM, Network, ... specific.

What is your scale?

number of machines?
redis spec?
jobs/day

adalyz commented 7 years ago

Currently we have 4 machines but every machine is using its own local redis, as we try and connect one more machine to an existing redis we start getting problems.

We are based out of Azure and i have tried both Azure Cache as well Dedicated self hosted VM for redis. We are processing > 1 Million Jobs on each machine every day. Every Machine has 4 cores and 8 GB ram which we use for workers + redis

The error which we get is around Read Timeout from Redis. Is there any way to debug this to find the acutal cause?

We also tried using twemproxy but that also did not work out.

behrad commented 7 years ago

The error which we get is around Read Timeout from Redis

never faced that, You should inspect more to spot the problem... if it is a memory issue? or a network one? with your redis machine... May be you'd better to add second machine load incrementally to see what is actually happening...

MichaelTurbe commented 7 years ago

We are running several Kue instances against the same redis instance (in Heroku) with no issues.

osher commented 7 years ago

How do many instances behave with workers limit? Do they share the limit across all instances, or each instance plays it's own workers limit?

The first would be very cool the second - not so much...

MichaelTurbe commented 7 years ago

You mean the concurrency of each job type? It's the second. For example I have 3 workers that can process a particular job type, with the concurrency set to 2 for that job type. I will end up with 6 jobs of that type running at the same time. Is that what you meant?

daniellevinson commented 6 years ago

Hey guys,

Is there a code example somewhere that's tested and works flawlessly when scaling on multiple machines? I've tried to simulate this by running multiple docker containers on my local machine. I've implemented the recommended graceful shutdown however when I stop one of the containers (the other ones keep on running) the job that's running on the stopped container gets stuck in active state. Adding more jobs or containers makes no difference. I do see the Kue shutdown message in the logs so I know the queue.shutdown does get called. Is there a workaround? couldn't find one in other threads.

Thanks!

MichaelTurbe commented 6 years ago

When I have to shut down or restart any of the worker servers the active jobs always get stuck in active; just have to restart them. Not optimal, but good enough for now and I'm not sure how you'd get around that..

daniellevinson commented 6 years ago

Thanks for the comment @MichaelTurbe!

"Not optimal" is a huge understatement for manually restarting jobs in an automated job queue. This is core functionality, I'd switch to ampqlib or something, give up the nice abstraction and UI but at least make sure jobs are never stuck. But that's my opinion.

osher commented 6 years ago

@MichaelTurbe - yes, that's what I meant. Sorry to hear that. I understand that to accomplish this we'd need a change in architecture.

Maybe it's time to start gathering requirements for the a version that can answer requirements such as these.