centrifugal / centrifugo

Scalable real-time messaging server in a language-agnostic way. Self-hosted alternative to Pubnub, Pusher, Ably. Set up once and forever.
https://centrifugal.dev
Apache License 2.0
8.46k stars 597 forks source link

[question] Sending channel name to subscribe proxy #701

Closed babakbay closed 1 year ago

babakbay commented 1 year ago

I have subscribe requests proxied to a backend script however this script needs to know the name of the channel client is subscribing to. Since there will be many channels with each one processing data from a different external source I only want to ingest and process data for a channel if it has active subscribers.

The docs state a payload will be sent to the backend. e.g.

{
  "client":"9336a229-2400-4ebc-8c50-0a643d22e8a0",
  "transport":"websocket",
  "protocol": "json",
  "encoding":"json",
  "user":"56",
  "channel": "chat:index"
}

I checked and the backend seems to be receiving an empty POST body. What am I missing? Is there a special config parameter to enable passing the payload to backend?

Also I need to stop the data feed for a channel when it has no more subscribers. I guess best way to do this would be to use the "presence: true" config and periodically check for any subscribers using api at /api/presence_stats ?

My config:

{
  "token_hmac_secret_key": "...",
  "admin_password": "...",
  "admin_secret": "...",
  "api_key": "...",

  "allowed_origins": ["*"],
  "admin": true,
  "admin_insecure": true,
  "allowed_origins": ["*"],

  "redis_address": "127.0.0.1:6379",
  "redis_password": "",
  "redis_prefix": "centrifugo",

  "history_size": 10,
  "history_ttl": "30s",

  "presence": true,

  "allow_anonymous_connect_without_token": true,

  "proxy_subscribe": true,
  "proxy_subscribe_endpoint": "http://localhost/backend.php",
  "proxy_subscribe_timeout": "2s",
  "proxy_http_headers": [
    "Cookie"
  ],

  "namespaces": [
    {
      "name": "public",
      "proxy_subscribe": true,
      "presence": true
    },
    {
      "name": "members",
      "allow_subscribe_for_client": true,
      "proxy_subscribe": true,
      "presence": true
    }
  ]
}

Thank you.

babakbay commented 1 year ago

Aha found the problem with not receiving payload. Nothing to do with Centrifugo. It was a PHP thing. Case helps anyone using PHP for backend scripting use $jsonData = file_get_contents('php://input'); instead of $_POSTto capture the payload since content type is of type 'application/json'

However maybe I can get your response on this:

Also I need to stop the data feed for a channel when it has no more subscribers. I guess best way to do this would be to use the "presence: true" config and periodically check for any subscribers using api at /api/presence_stats ?

FZambia commented 1 year ago

Hey, @babakbay

Also I need to stop the data feed for a channel when it has no more subscribers. I guess best way to do this would be to use the "presence: true" config and periodically check for any subscribers using api at /api/presence_stats ?

I suppose at this point yes - polling presence is required. Would be great to know more details about your use case because I still thinking how Centrifugo could simplify such scenarios - what is the task you want to do while there are active subscribers?

babakbay commented 1 year ago

Project I'm working on involves consuming raw data from financial markets (both stock market and crypto exchanges). It processes the raw data (doing calculations, aggregations, etc) and publishes the resulting msgs via websocket channels to clients. We want to do this only for equities which have at least one client subscribed to. This is where the subscription proxy of centrifugo comes into play. This is the design I'm implementing. e.g.

  1. User subscribes to websocket channel market:crypto:binance:btc-usdt
  2. Centrifugo subscription proxy sends subscription request to backend script
  3. Backend script uses channel name to determine what data is being requested and invokes a script to consume and process raw data (i.e. btc data from binance websocket api)
  4. Processed data is sent to active subscribers for channel market:crypto:binance:btc-usdt

The amount of data that needs to be consumed, processed, and published when there are numerous active channels can quickly become massive, especially if clients request to receive order book updates. Tens of thousand per second. Currently I have the processed data getting stored into redis timeseries as well as a copy sent to a redis PubSub channel.

Now this is where I had a similar concern to open ticket #673

The overhead involved in sending each msg via an http request to the admin api concerns me. I've read your response where you state the additional overhead would be minimal. Have you tested it with tens of thousands of msgs per second via admin api vs directly into PubSub channel? Not saying it'll be too slow I just haven't done any benchmarks. I tried to publish directly into the corresponding centrifugo PubSub channel which admin api sends to but noted the data needs to be encoded first (I think to protobuf?)

Now I understand publishing directly to pubsub channel vs admin api probably loses some functionality such as msg history and breaks centrifugo protocol. I also noted that in previous versions it seemed possible to publish directly via redis using LPUSH. So I'm wondering:

  1. What logic does the admin api perform?
  2. Perhaps I can perform the same logic within redis via JS using its new redis triggers and functions feature (available in v7.2):

If the LPUSH way of publishing was available in centrifugo v5 then perhaps the following solution will be ideal for my purpose:

LPUSH msg and use redis v7.2 to trigger a JS function on each LPUSH. That JS function does whatever admin api does, then encodes the msg using protobuf and publishes it to the same PubSub channel admin api does.

What are your thoughts on this? I'm not sure what logic admin api performs.

I'm also wondering if sharding admin api publish requests across multiple centrifugo admin APIs would reduce latency. Is this possible when clustering centrifugo or do all msgs need to be published to a single admin api instance?

Finally perhaps if admin api can receive these requests by some other mean than plain old http (redis PubSub, SSE, HTTP 3,..) that might help too.

FZambia commented 1 year ago

Thanks for the details!

The amount of data that needs to be consumed, processed, and published when there are numerous active channels can quickly become massive, especially if clients request to receive order book updates.

Well, at least it seems there is no scalability bottleneck in this scheme, all the parts may be scaled up. BTW, check out https://centrifugal.dev/blog/2023/08/29/using-centrifugo-in-rabbitx blog post if you didn't see it yet. Seems relevant.

The overhead involved in sending each msg via an http request to the admin api concerns me. I've read your response where you state the additional overhead would be minimal. Have you tested it with tens of thousands of msgs per second via admin api vs directly into PubSub channel?

I don't remember that I told that overhead will be minimal in this case, probably I could say it in some other context. Anyway, my concerns against publishing over queue or via Redis PUB/SUB channel is still here - it breaks current API boundaries. Publishing over Redis PUB/SUB may be more resource efficient for large volumes since there is no HTTP machinery involved. In terms of latency involved it's a RTT/2 overhead between Centrifugo and Redis.

I tried to publish directly into the corresponding centrifugo PubSub channel which admin api sends to but noted the data needs to be encoded first (I think to protobuf?)

Yes, it's Protobuf encoded. So to publish directly to Redis message must be properly encoded - but it's an internal part of Centrifugo with all the backwards compatibility consequences.

Now I understand publishing directly to pubsub channel vs admin api probably loses some functionality such as msg history and breaks centrifugo protocol.

Yes, it only provides at most once semantics. Which may be enough if you design you messages to have some app-specific identifiers. But Centrifugo built-in history features won't work in this case - because when publishing over server API we execute LUA script which also adds publication to Redis STREAM data structure.

LPUSH msg and use redis v7.2 to trigger a JS function on each LPUSH. That JS function does whatever admin api does, then encodes the msg using protobuf and publishes it to the same PubSub channel admin api does.

Well, I suppose it's possible, you can just use the same Lua we are executing when publishing. But again - it's a Centrifugo implementation detail. And while there are no plans to refactor it at this point we can do this in the future.

Though using functions from Redis 7.2 could potentially help to standartize things since LUA seems a bit more involved to operate with.

I'm also wondering if sharding admin api publish requests across multiple centrifugo admin APIs would reduce latency. Is this possible when clustering centrifugo or do all msgs need to be published to a single admin api instance?

If I understood the question correctly - yes, you can publish to different Centrifugo nodes. Regarding latency - Centrifugo processes requests concurrently, so most probably you will only notice latency difference if one node can't keep up with the load.

Finally perhaps if admin api can receive these requests by some other mean than plain old http (redis PubSub, SSE, HTTP 3,..) that might help too.

I'd say that verdict here should be based on real data, real problem and showing how avoiding Centrifugo HTTP/GRPC interfaces helps to solve it. I think publishing over Redis directly may be beneficial – no doubt – we just can't really proceed with standartizing such things without numbers justifying breaking API boundaries in this way. That's why I am keeping #673 open.

But regarding latency - note that having RTT/2 between Centrifugo and Redis latency overhead may be totally sufficient for your use case to not introduce extra complexities. And CPU usage overhead due to HTTP/GRPC machinery may be sufficient too - it scales, and Centrifugo knows how to scale Redis too hiding this behind its API - so you don't need to think about scaling Redis and publishing data to the correct Redis node when publishing, this all works out of the box with current server APIs.

FZambia commented 1 year ago

One more thing that just came into my head - probably for your use case you don't really need Redis at all, and can somehow load balance users over Centrifugo nodes with Memory engine. It may be very efficient and there won't be additional latency for Centrifugo-Redis communication. Will require some layer on top for proper routing of connections and publishing.

FZambia commented 1 year ago

Closing, not sure that something actionable left here. Overall, some good ideas were suggested here but need more data to proceed with them.