High Availability and Scaling setup

matrixbot commented 3 weeks ago

This issue was originally created by @genofire at https://github.com/matrix-org/dendrite/issues/2975.

Is Dendriten without polylith still possible to scale and high availability setup possible?

Is it possible to run multiple dendrite with connection to same database?

matrixbot commented 3 weeks ago

This comment was originally posted by @S7evinK at https://github.com/matrix-org/dendrite/issues/2975#issuecomment-1435163844.

Dendrite wasn't able to run in "HA" mode to begin with, as in you couldn't have more than one component of the same type running at the same time. Dendrite will still be more performant than Synapse for small user deployments. (While we have anonymous usage stats, we currently don't know if there are huge, e.g. >1k users, Dendrite deployments in the wild and how they are performing)

Going to quote @kegsay for the reason behind this change here:

hey folks, after much discussions we've finally decided on a direction for dendrite, instead of being constantly tugged between embedded/p2p and massively-scalable deployments. We've ultimately decided to go down the embedded/p2p route, and will be making changes over the next few months to reflect this new reality. This is a significant change: probably the biggest one since we moved from Kafka to NATS. The ramifications include:

polylith mode will be removed from the project, including internal HTTP calls.

component databases will change to take advantage of monolith mode: for example we currently store every event twice, once in roomserver and once in syncapi. This will be optimised so we only store it once.

we want to make dendrite more modular: running in embedded mode should not attach appservice code for example, ideally not even build it to keep binary sizes low.

we will be adding runtime/trace support, which makes it significantly easier to debug performance bottlenecks in single processes. We're aware performance is an issue, and having tracing support here will be a game changer in allowing us to see if its due to bad SQL (it usually is..), GC, or poor O(n^2)+ algorithms in code.

"internal API"-like functions are now as cheap as regular function calls, so code which previously assumed this was expensive and tried to minimise these calls will be re-evaluated. This is a big change for dendrite devs, as it's a huge change in thinking.

the directory structure of the project will be revised, given we no longer have components. E.g where should a shared events database live when both roomserver/syncapi need it.

Our use of NATS will be re-evaluated, and it may be removed from the project. Much of its benefits came from being able to seamlessly provide a message queue for both mono and polylith modes: but now it's increasingly a liability and source of bugs (jetstream directory bloat, random "timed out sending to NATS" when used in P2P, etc).

We will also be taking this time to land in a few unrelated but also likely breaking changes, some of which came from folks here who I met at FOSDEM, including but not limited to:

Config YAML: All secrets will now be referred to by file path, and not hard-coded in the YAML. This will make systemd and k8s deployments easier, as secrets can be mounted to files and the YAML needn't be secret. The YAML will also go through a major version bump to strip out polylith sections and rejig sections entirely given components like roomserver don't really have any meaning anymore. We'll likely have a migration script which can automatically map from v2 to v3 format wise.

Registration/Login: we are looking into adding native OIDC support - this keeps the Dendrite codebase small and maintainable in this area, whilst providing more options for server admins to add things like SSO which we are aware is a sore point currently. Basic password login/registration will remain as a simple way to provide accounts for users, and keeps sytest/complement happy, but anything more complex than that and we'll be looking to OIDC for answers.

We will be keeping both Postgres/SQLite support, even though we won't use Postgres in embedded scenarios. The maintenance burden for us here is significant, but postgres' performance, coupled with the fact that we've basically told people to run using postgres and actively pushed people to do so, mean we will maintain support for it as a first class citizen.

We're aiming to land as much of this as possible over the next few months, incrementally. Breaking changes from a server admin's pov will be kept to a minimum and will be associated with a version bump.

matrixbot commented 3 weeks ago

This comment was originally posted by @genofire at https://github.com/matrix-org/dendrite/issues/2975#issuecomment-1439573082.

if you could deactivated accounts (after registry without capture ...), here is an server with over >1k:

dendrite_sum7=# select count(*) from userapi_accounts;
 count
-------
  3374
(1 row)

dendrite_sum7=# select count(*) from userapi_daily_visits ;
 count
-------
  1630
(1 row)

dendrite_sum7=# SELECT
    pg_database.datname,
    pg_size_pretty(pg_database_size(pg_database.datname)) AS size
    FROM pg_database ORDER BY pg_database_size(pg_database.datname) DESC;
      datname       |  size
--------------------+---------
 dendrite_sum7      | 30 GB

(see #2464)

till you do not release v1.0.0 you would not find any huge server ....

for an update process in kubernetes (with zero downtime) it would be nice, to have the possibility running two dendrite's at the same time (on the same database) ...

element-hq / dendrite

High Availability and Scaling setup #2975