realms: implement shared runtime state

om26er commented 2 years ago

Serialization is probably the most CPU intense thing that Crossbar does, apart from dealing with websocket framing. As Crossbar is a single process, this limits the max number of messages the router can process.

This is probably a wild and misplaced idea but would it be possible to offload the serialization to multiple worker processes or that against the overall crossbar architecture ?

EDIT As a memory optimization and better integration is would make sense to optionally support dumping realm runtime state on storage, this enables N workers running on a host to have the shared state. This is very effective as far as memory consumption is concerned.

om26er commented 2 years ago

The reason why running multiple router workers to distribute the "load" is not feasible is because, when using dynamic realms, both router workers would have to start all the realms and this gets problematic very quickly in terms of RAM usage. For example starting 100k dynamic realms takes like 13GB RAM when using PyPy.

om26er commented 2 years ago

One "optimization" of course is to enforce a single serializer like MsgPack or CBOR, so that the actual CPU usage is reduced. It is quite clear that when caller and callee are using non-matching serializers, the router has to do plenty of work.

oberstet commented 2 years ago

Serialization is probably the most CPU intense thing that Crossbar does, apart from dealing with websocket framing.

yes, true!

other than that: if you activate websocket compression, if you run TLS. those (and the 2 you list) are the big cpu eaters. websocket: small msgs, its the framing. large msgs, its the masking/utf8 and actual byte copying. for the masking/utf8, there is https://github.com/crossbario/autobahn-python/tree/master/autobahn/nvx - that works, but currently isn't integrated into AB. the code is for utf8. for masking: I somewhere should have some intrinsics based code. it's relatively easy.

This is probably a wild and misplaced idea but would it be possible to offload the serialization to multiple worker processes or that against the overall crossbar architecture ?

both yes: it would work kinda against architecture, but more importantly, it would speed up things (well, my bet, not that I measured it;)

btw, here is a baseline: https://crossbario.com/docs/crossbarfx/benchmarks.html#wamp-message-serialization

One "optimization" of course is to enforce a single serializer like MsgPack or CBOR, so that the actual CPU usage is reduced.

yes, not having to re-serialize is very effective - in WAMP, this is called "payload transparency" - and XBR is using payload transparency (but more because of e2e)

https://github.com/crossbario/crossbar-examples/tree/master/payloadcodec

This is probably a wild and misplaced idea but would it be possible to offload the serialization to multiple worker processes or that against the overall crossbar architecture ?

running multiple router workers is the solution and already implemented - I use 50k calls/sec as max perf / process for fully routed wamp msgs with cbor
not touching the app payload is also effective (that comes for free with xbr)

oberstet commented 2 years ago

For example starting 100k dynamic realms takes like 13GB RAM when using PyPy.

this sounds bogus ... sth must be wrong .. that's 140k/realm?

The reason why running multiple router workers to distribute the "load" is not feasible is because, when using dynamic realms, both router workers would have to start all the realms and this gets problematic very quickly in terms of RAM usage.

I disagree with that.

the overhead per worker for running a new pypy process (a worker) - is one thing. but non critical. same per listening socket
the overhead per realm within a worker: this is critical. there might be issue (above?). if so, we should be able to fix. the other means to improve: use a zLMDB database for a realm, shared between workers. a python process can mmap a 1TB db without increasing its own mem footprint

I've added an "needs-investigation" label to this issue ... because ^.

om26er commented 2 years ago

My initial 13GB for 100k realms was wrong, it was for 2 router workers, each running 100k realms, sorry. So it's 7+ GB for 100k realms. Is that "reasonable" ?

I have created a quick project to show that https://github.com/om26er/cb-dynamic -- only need to run crossbar start from root directory. Each realm has two roles and 1 router component. Starting this does take 4-5 minutes.

Note: CPython is like 2-3X slow when starting 100k realms as compared to PyPy.

om26er commented 2 years ago

2. the overhead per realm within a worker: this is critical. there might be issue (above?). if so, we should be able to fix. the other means to improve: use a zLMDB database for a realm, shared between workers. a python process can mmap a 1TB db without increasing its own mem footprint

Interesting, that kind of solves it all IMO i.e. if we can reduce the memory footprint like that, then running 8 workers on a 8 core system could potentially enable 200k calls/second on a single node (more below)

running multiple router workers is the solution and already implemented - I use 50k calls/sec as max perf / process for fully routed wamp msgs with cbor

I only get those numbers (50k calls/sec) when both the caller and callee are using RawSocket+CBOR. I get 25k calls/sec when the caller is using WebSocket+CBOR (which is generally the case, in production)

oberstet commented 2 years ago

So it's 7+ GB for 100k realms. Is that "reasonable" ?

actually not. I don't have an explanation for this mem consumption, because a realm is nothing more than a bunch of python objects. I would have expected <1kB/realm. it is definitely worth tracking down and understanding ...

rgd test project: perfect! this allowed me to quickly check what is actually started, and how. eg

Starting this does take 4-5 minutes.

one optimization is to start realm in parallel - not use co-routine await in the mgmt api calls, but collect the deferreds and wait on the list.

the other optimization would come with use of zLMDB, to share (and optionally persist) the run-time state, incl. realms, and then not having to create all objects on the heap from scratch, but simply access the data when needed, and via zLMDB driven mmap'ing the data

Interesting, that kind of solves it all IMO i.e. if we can reduce the memory footprint like that, then running 8 workers on a 8 core system could potentially enable 200k calls/second on a single node (more below)

exactly! we would need to: refactor router (broker/dealer) to extract/collect all run-time state into plain Python and zLMDB database classes with same interface, to plug&play depending on whether router state is confined or shared

I only get those numbers (50k calls/sec) when both the caller and callee are using RawSocket+CBOR. I get 25k calls/sec when the caller is using WebSocket+CBOR (which is generally the case, in production)

yes, this is expected. also note: 50k routed wamp calls per sec equals to 200k wamp messages sent/received (4 messages through the router per wamp call .. non-progressive).

om26er commented 2 years ago

Ultimately, I'd like to make realm startup fast so that when Crossbar is restarted, it would automatically start hundreds of thousands of realms without taking too long. Might require some architectural changes, we'll see.

oberstet commented 2 years ago

fwiw, I would expect the startup speed of many realms be limited by 3 things:

speed of a single router process is limited by pypy and cpu clock
number of router workers (and rlinks) should be >= cpu cores
making the master code issue parallel realm start commands

the read speed of a mmap'ed lmdb is virtually ram speed. no bottlenecks there.

the startup speed of many concurrent python processes read mmap'ing one lmdb might also be limited by system mem alloc speed. well, the mem allocator in use will be relevant. anyways, this is esoteric -- all of this will likely don't matter in practice .. my bet

om26er commented 2 years ago

One other factor that I just found is the impact of logging on start up performance of realms. With loglevel set to warn, realms start the fastest. I was able to start 100k realms in just 135 seconds

loglevel warn: 100k realms start in 135 seconds
loglevel info: 100k realms start in 260+ seconds
log to file with log level info: 200 seconds

oberstet commented 2 years ago

I was able to start 100k realms in just 135 seconds

alright, that's already quite reasonable I'd say. like new 1k realms / sec

how did you start the realms? via master node and node management API ("start_router_realm" or sth)? did you start all realms in 1 router worker?

given N CPU core, fastest would be N router workers, and then concurrently start realm via master node API ..

if one would want to optimize beyond that, you could enable vmprof in the router workers and record/analyze where CPU is spent while spinning up realms ...

With loglevel set to warn, realms start the fastest.

ok, the logging of realm startup (and in general) is likely too noisy at the info (=default) level then I guess.

om26er commented 2 years ago

how did you start the realms? via master node and node management API ("start_router_realm" or sth)? did you start all realms in 1 router worker?

I call the start_router_realm API from an in-router component running inside the worker. And yes I currently have a single worker. To be clear, I am an not using the master node and have a custom authenticator that basically authenticates different crossbar nodes to connect to each other.

given N CPU core, fastest would be N router workers, and then concurrently start realm via master node API ..

This is interesting, even if I have let's say 8 workers, would all the router realms need to be started individually on all workers i.e. if we have 1000 application realms, we'd need to start those 1000 realms on each worker i.e. 8 thousand times

if one would want to optimize beyond that, you could enable vmprof in the router workers and record/analyze where CPU is spent while spinning up realms ...

Thanks for the hint, will look into that now

ok, the logging of realm startup (and in general) is likely too noisy at the info (=default) level then I guess.

Yes, the default logging while it is useful in some cases, can become very difficult to analyse as the number of connections and realms increase.

om26er commented 2 years ago

On the contrary if all the workers have a shared state based on zlmdb, then real parallelization could come to play i.e. all workers can try to start the realms in parallel like 125 realms for each worker.

I will try to run some numbers on the impact on memory consumption when hundred of thousands or even a million WAMP sessions are active. So we may want to expand the use of zlmdb in that case as well

oberstet commented 2 years ago

I call the start_router_realm API from an in-router component running inside the worker.

ah, cool! guess you enabled the bridging of the management API to the app realm for your component. yes, actually, this is a supported mode - with the caveats that come with it.

running a component in the node controller of the local node (which is not supported via config or such currently) would be yet another way

and finally, running a component connected to the master node holding uplink management connections from nodes

would all the router realms need to be started individually on all workers

there are 2 cases: a single node consumes less or more CPU cycles for routing than a single CPU core can provide. in the latter case, you want to run that realm in multiple router workers - and yes, in that case, each of the router workers will need to have the realm started, and also have rlinks to the other router workers operating for the same realm - this group of processes is the "router worker cluster" (how it's called in the master node ..)

Thanks for the hint, will look into that now

vmprof can be started at worker process boot time. there is a worker process CLI flag to start it. that flag can also be set via the controller worker when it starts a new worker.

there also should be an node config option .. and a run-time option in the node management API when starting workers remotely from the master

oberstet commented 2 years ago

all the workers have a shared state based on zlmdb

yes, that would make things probably both easier and more efficient.

fwiw, the group of processes that need to share run-time state (persistent or not, but via zlmdb in any case) is the "router worker cluster" started on a single node for a given realm.

for such a realm, as there might be multiple nodes hosting all the worker processes of the cluster group, we are nevertheless talking about "persistent caches" in a way ... there is no DB abstraction spanning all the zlmdb slices ..

the latter is by design: a shared backing DB for a realm is not what we want .. we want a single controlling instance (the master code) orchestrating stuff via WAMP on the nodes, in the form of cluster workers. plus the WAMP meta API & router-to-router link messaging based integration

given that last aspect, I'd really only plan for and design for persistent caches (zlmdb based) on a per-router-worker basis for the run-time state. the realm configuration and realm metadata is sth different ...

guess "it depends" ;)

om26er commented 2 years ago

there are 2 cases: a single node consumes less or more CPU cycles for routing than a single CPU core can provide. in the latter case, you want to run that realm in multiple router workers - and yes, in that case, each of the router workers will need to have the realm started, and also have rlinks to the other router workers operating for the same realm - this group of processes is the "router worker cluster" (how it's called in the master node ..)

The most dumb scenario that I can think of is the case where there are

N crossbar nodes (each with 1 router worker)
M realms (one realm per "organization")

In that case I expect all the router workers to have all the realms running and have rlinks to each other. This is because each "organization" can have any number of clients (lets say many members with mobile app etc). Since the load balancer decides which specific crossbar instance a new client is going to connect to, we'd need to make sure all instances have the same realm started.

One alternative could be to start router realms on demand i.e. when a new client connects to a router, the realm should be started (by authenticator) and at the same time rlink should be established to only those crossbar nodes that already have the given realm started/has clients of the same org connected. Of course to do that a "discovery" service needs to be created but I don't think that's too much of an effort.

oberstet commented 2 years ago

the association between the elements involved can be seen here https://github.com/crossbario/crossbar/blob/master/docs-cfx/_static/cfx-clustering-elements.pdf

this allows to cover all scenarios efficiently

The most dumb scenario

why is that interesting?

Since the load balancer decides which specific crossbar instance a new client is going to connect to, we'd need to make sure all instances have the same realm started.

of course not, that would be dumb;) the web clusters terminating incoming wamp connections of course know the relation between hosted realm and node (that is, the code in master has that info for the whole network of nodes ..)

oberstet commented 2 years ago

put differently, given

a list of realms, and
a "performance required" per realm in number of logical CPU cores, and
a list of nodes with number of CPU cores per node

one can compute optimal mappings .. and configure the nodes accordingly. actually, this should be part of the master node code ...

oberstet commented 2 years ago

M realms (one realm per "organization")

if each realm carries <50k routed RPCs/sec app traffic, and your nodes have N cores, then start N router workers (in N router cluster groups of size 1), N per node, each with own listeing transport, and
start the realms in ^ so that the summed RPCs/sec per router worker remains <50k
configure the web cluster to direct incoming connections to the worker listening and hosting the respective realm
if a realm carries >50k, but <50*N, then co-located multiple router workers in a cluster group on the same host
otherwise, you have to spread router workers out over multiple nodes, which then will also incur network traffic of course

put yet in another way, in terms of "optimal configurations":

think of the hardware resources of your set of nodes as one big pool of CPU cores each maxing out at 50k RPC/s
you can put processes (== workers) onto those CPU cores as long as the traffic they push together is less than 50k RPC/s
if "optimal" means "no overload" AND "everything used up", then you have a combinatorical optimization problem

if you plan out your cluster at this detail (and the workload stays stationary, otherwise it is all dynamic ..), then you might want to pin the exact worker-to-core association

crossbar workers support to set "cpu affinity" for this (and you could expand master code to make use of that)

crossbario / crossbar

realms: implement shared runtime state #1899