CJWorkbench / channels_rabbitmq

A Django Channels channel layer that uses RabbitMQ as its backing store
39 stars 17 forks source link

scalability #36

Closed aryaniyaps closed 3 years ago

aryaniyaps commented 3 years ago

How much can this project scale? I am curious because the redis channel layer provided by django is really slow when it comes to group sending, and I am looking for alternatives.

https://github.com/django/channels_redis/issues/83

thanks a lot!

adamhooper commented 3 years ago

I have no hard numbers, but I boldly promise: "better than channels_redis".

At scale, you want to maximize:

A. Single-node message throughput -- number of messages routed through a single Channels client B. Cluster message throughput -- total number of messages sent through the entire system

When it comes to single-node throughput, there should be no contest: channels_redis blocks every WebSocket connection for every .receive(). channels_rabbitmq does not. So if a message gets sent to 100 channels on a single server, channels_redis will request 100 messages one-after-the-other, and channels_rabbitmq will receive all 100 messages near-simultaneously.

I haven't measured how big a difference that makes. I'd expect channels_rabbitmq to be maybe 10x faster, if network requests cost ~10x as much wall time as Channels' innards and your consumer.

But that 10x speedup is just single-node performance.

The real scaling problem in channels_redis is actually B -- the whole cluster. group_send() is ... well ... absurd. To group_send() on Redis, the layer will:

  1. Tell the Redis server holding the "group" to expire old WebSocket connections (this is a bug; we'll get to it later)
  2. Read the entire data structure -- that is, read the list of all WebSocket connections listening to that group
  3. Connect to each Redis server that has a subscriber, in serial
  4. On each Redis server, append the message to each message queue

By my count, channels_redis group_send() grows O(n^2) wrt number of connections if all connections join the same group:

This is my theorizing, anyway. I haven't tested.

I haven't tested, because why would anybody need to test? Redis is not a message broker. It's incorrect on a single node; why would anybody want to scale something faulty?

We at Workbench left Redis when we discovered that bug in step 1: if you don't group_expire healthy WebSockets connections, the whole cluster stalls without notification; and if you do group_expire healthy WebSockets connections, then you aren't doing your job.

I get riled up about this. Projects like channels_redis are fundamentally flawed. Yet users and developers all double down and double down and double down, trying to accomplish the impossible. There's a perfectly free, handy, sound system out there in RabbitMQ and nobody loves me because I berate people for wasting years of their own time and other people's time instead of spending the minutes it would take to test out docker run rabbitmq:3.

aryaniyaps commented 3 years ago

thanks a lot for your information! @adamhooper I have a question regarding my business model, though.

according to my business model, there are objects called boxes. A lot of people can join boxes -100k or more- no hardcoded limit. Whenever a user puts a file in a box, then I need to send an FILE_CREATE event to everyone else present in the box.

and many more events, like when users leave a box, or join it. This is why I worry. Will channels_rabbitmq be able to handle this? If it cannot, then can you please suggest some other solutions?

aryaniyaps commented 3 years ago

To scale enormously, this layer only creates one RabbitMQ queue per instance. That means one web server gets one RabbitMQ queue, no matter how many websocket connections are open. For each message being sent, the client-side layer determines the RabbitMQ queue name and uses it as the routing key.

have you tested against rabbitmq to see how many concurrent connections one queue can handle? As per the quote above, in order to have more queues, we would need more webserver instances, am I right?

adamhooper commented 3 years ago

RabbitMQ is designed to handle huge loads -- tens of thousands of messages per second per CPU, easy. channels_rabbitmq is designed to stay out of the way.

I have not benchmarked. In Workbench, with hundreds of concurrent connections, channels_rabbitmq costs essentially zero overhead, so I haven't needed a round of optimizations. I look forward to someone benchmarking and suggesting optimizations based on that evidence.

If you're serious about handling 100k+ concurrent connections, abandon Python now. I cannot imagine a more expensive language for handling 100k concurrent connections.

aryaniyaps commented 3 years ago

@adamhooper I've settled with elixir for handling concurrent connections after learning it, thanks for your reply!

adamhooper commented 3 years ago

Probably a good idea.

For posterity: I suggest -- based on intuition, not evidence! -- stay away from Django if you want to serve 500-1,000 active connections per web server. Node, Go and Elixir should make it easier to code efficient software and harder to introduce monstrous bottlenecks.