fly-apps / postgres-flex

Postgres HA setup using repmgr
67 stars 32 forks source link

FLY_SCALE_TO_ZERO doesn't seem to work? #217

Closed lewisl9029 closed 3 months ago

lewisl9029 commented 6 months ago

I created a new single-instance postgres dev instance, downloaded the fly.toml, and added FLY_SCALE_TO_ZERO = "1h" to the env section, but it still seems to be running 24x7.

Looking at the logs, I see a bunch of lines like this one:

2023-12-21T00:49:16Z app[91857591c67e98] sjc [info]Current connection count is 1

That seem to be coming from this line, indicating the reason is there is always at least 1 connection to the db, even though this is the only app actively running in my org at the time.

I suspect this is some monitoring/replication tool making the connection? Any ideas what it is and how I could disable it for my dev instance so the scale to 0 feature actually works?

davissp14 commented 6 months ago

Can you let us know what image tag/version you're using?

fly image show
lewisl9029 commented 6 months ago

Ah oops forgot to include this. I'm using the latest tag (15.3) available on dockerhub at time of writing. Here's the fly image show output:

postgres % fly image show 
Image Details
MACHINE ID      REGISTRY                REPOSITORY              TAG     VERSION DIGEST                                                                    LABELS                                                                                               
91857591c67e98  registry-1.docker.io    flyio/postgres-flex     15.3    v0.0.46 sha256:44b698752cf113110f2fa72443d7fe452b48228aafbb0d93045ef1e3282360a6   fly.app_role=postgres_clusterfly.pg-manager=repmgrfly.pg-version=15.3-1.pgdg120+1fly.version=v0.0.46
acburdine commented 5 months ago

I was running into the same issue, dug into it a bit more and I think I figured out what's happening.

in this commit the logic was changed from "current connections > 1" to "current connections >= 1". I'm unsure as to why that was done - but either way I think I know what the fix is.

Currently the monitor is running the query:

select count(*) from pg_stat_activity where usename != 'repmgr' and usename != 'flypgadmin' and backend_type = 'client backend';

this filters out most queries, but what is specifically does not filter out is itself - the query it runs to get the count of the active pg_stats also includes itself, which means the count is always going to be at least 1. I verified this by forking this repo, tweaking that select query to output the query field from pg_stat_activity, and it showed its own query in the returned rows.

I'm happy to submit a fix for this, but I'm not entirely sure what the best fix should be. The easiest solution (imo) would be to revert the commit I linked above (changing current >= 1 back to current > 1) or to add return current - 1 in the getCurrentConnCount function, but without understanding the decision behind the earlier change I'm not sure if that will have other effects.

Another solution, albeit more complicated, could be to set a fixed application_name connection parameter for the local pg connections that we can then use in said query to filter itself out of the results.

mwcampbell commented 3 months ago

I wonder if we actually still need FLY_SCALE_TO_ZERO now that the platform itself can auto-stop machines with no network activity. Does that method of stopping the machine, presumably by sending the (configurable) kill signal, do a graceful shutdown of Postgres?

swilson-fly commented 3 months ago

Hey folks! A fix for this has been released as of image version flyio/postgres-flex:15.6 (v0.0.51). Running fly image update and updating your cluster to use the new image should have scale_to_zero working as expected again.

For postgres apps, we really recommend using the baked in scale to zero functionality over the proxy autostop. The FLY_SCALE_TO_ZERO method cleanly shuts down your postgres instance. With the proxy autostop we've seen cases where it doesn't play as nicely with postgres, and clusters end up in a weird state.