dimagi / commcare-cloud

Tools for standing up and managing a CommCare HQ server environment
https://dimagi.github.io/commcare-cloud/
BSD 3-Clause "New" or "Revised" License
31 stars 38 forks source link

Lower max connections in postgres from 300 to 125 on India #6383

Open gherceg opened 1 month ago

gherceg commented 1 month ago

https://dimagi.atlassian.net/browse/SAAS-15954

We saw a spike in db connections on India that almost resulted in the instance running out of memory. We typically sit somewhere around ~50 connections, and the spike was up to 156, so we certainly couldn't handle 300 connections with the current machine size. I think we should decrease the number of max connections in this env to accommodate the RDS instance size.

I think because of our usage of pgbouncer in front of the postgres instance, we are already setup in a way that this change won't have an impact on users. If we do reach 125 connections, a client might need to wait a little longer for a request to complete since pgbouncer will wait for a connection to be freed up, but given we aren't hitting 125 connections regularly I don't anticipate this being an issue. We can monitor this graph to see if there is an increase in wait times.

Environments Affected

India

gherceg commented 4 weeks ago

Will need to run cchq --control india ap deploy_postgres.yml --tags=pgbouncer --limit=pgbouncer to apply this change.

millerdev commented 3 weeks ago

Do we have any way of measuring if the memory size per connection is reasonably consistent, or is there wide variation? If they are consistent then this seems reasonable, but if they are not then maybe there are circumstances where the previous limit of 300 could be useful and reasonable? Or possibly the inverse where maybe a lower number of connections could cause an OOM?

Seems like the ideal would be to have a setting to limit based on free memory (reserving enough to establish a few emergency connections) and otherwise allow unlimited connections as memory permits. But I'm guessing that's not possible.

gherceg commented 3 weeks ago

Yeah that is a good point, and good reminder that this change won't necessarily solve any potential problems we may encounter in the future. I can't say how much memory on average was held by each connection when we saw the spike to 156, but given the limited memory on this machine, it seems prudent to lower it at the very least. But this is a lot of hand waving. Based on historical trends, I don't think there is any need for a limit higher than 125 connections, so I'm content with this change, but am acknowledging that it does not guarantee we will avoid OOMs on this RDS instance in the future.