digdir / dialogporten

Dialogporten - common API and and metadata state store for digital dialogs
https://docs.altinn.studio/dialogporten
MIT License
1 stars 3 forks source link

Ensure we avoid reaching max connections towards postgresql #1471

Open arealmaas opened 2 days ago

arealmaas commented 2 days ago

Introduction

After doing some load tests, we seem to reach the max amount of connections. https://digdir.slack.com/archives/C07R312076E/p1731587548451379

An exception has been raised that is likely due to a transient failure. 53300: remaining connection slots are reserved for azure replication users

We need a solution to ensure that we avoid hitting this limit.

Description

Each runtime (and each of its replicas) using npgsql will by default maintain a application level connection pool of up to 100 connections. This means that the total number of pooled connections can be uforseeable high, depending on the number of replicas active. There are no mechanism to avoid exceeding max_connections set on PostgreSQL.

Just bumping max_connections is not a viable solution, as Postgre uses a process-per-connection model which incurs a significant ovehead both in cpu and ram with forking and memory footprint associated with the process (each connection incurs several MB of RAM, which postgre should rather spend on query cache etc)

A possible solution to this problem is using a connection pooler such as pgBouncer, pgCat or Supavisor, which allows for thousands of light-weight client connections that gets multiplexed over a relatively low number of "real" PostgreSQL connections. A comparison can be found here. https://tembo.io/blog/postgres-connection-poolers

It also seem like the most performant is to utilize a combination of connection pooling on the application layer and using a connection pooler proxy, to avoid incurring a network hop for each scoped DbContext. Ref: https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-pgbouncer#using-pgbouncer-with-other-connection-pools

Another solution here could be to add a connection pooler proxy as a separate deployment per type of application. That way we reduce the bottleneck and ensure that we can scale the services horizontally without hitting the limit.

Implementation

TODO: Decide on architecture and the connection pooler solution.

What should be the connection pool limit be for the applications? Do each container app job need a separate connection pool proxy, or just one for jobs? Should it be possible to scale the connection pooler horizontally? How to we monitor and act on high connection count?

### Tasks
- [ ] Implementation tasks are added here
- [ ] Prepare documentation (if relevant - either update working document, or add a new file in `docs`)
- [ ] Add e2e-test (if relevant)
### Threat modelling
- [ ] Does this change introduce any potential security issues?

Acceptance criteria

GIVEN ... WHEN .... THEN ...

GIVEN ... WHEN .... THEN ...

elsand commented 1 day ago

It appears it would be fairly easy to transition from one to the other, so I would suggest we start with pgbouncer, as this is fully supported and managed in Azure and will most likely deliver a sufficient level of scaling for now. Running any of the other non-managed options may delivery additional scaling potential, but at the cost of complexity and - according to the tests linked - higher latency. So opting for the low-complexity, low latency solution even though it has limitations in scale (which we as of now don't know if we need) seems like the best course of action.