canonical / postgresql-operator

A Charmed Operator for running PostgreSQL on machines
https://charmhub.io/postgresql
Apache License 2.0
8 stars 19 forks source link

changing a few config options breaks the charm #360

Closed nishant-dash closed 6 months ago

nishant-dash commented 7 months ago

Steps to reproduce

  1. deploy landscape bundle as attached
  2. add a few hundred landscape clients
  3. wait for things to settle
  4. charm config changes
    juju config postgresql memory_shared_buffers=3128 memory_max_prepared_transactions=500

Heres the bundle I used (attached the yaml as a txt file): landscape-bundle.txt

Expected behavior

For things to work

Actual behavior

the charms seems to completely break down

units are stuck executing hooks like so longer trace: landscape-trace.txt

15 Feb 2024 17:06:38Z  workload   active       
15 Feb 2024 17:06:39Z  juju-unit  executing    running commands
15 Feb 2024 17:06:40Z  juju-unit  executing    running restart-relation-changed hook
15 Feb 2024 17:06:41Z  juju-unit  executing    running restart-relation-changed hook for postgresql/1
15 Feb 2024 17:06:42Z  juju-unit  executing    running database-peers-relation-changed hook for postgresql/1
15 Feb 2024 17:06:53Z  workload   waiting      Awaiting restart operation
15 Feb 2024 17:06:54Z  workload   active      

Versions

Operating system: DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"

Juju CLI: 2.9.46

Juju agent: 2.9.46

Charm revision: 351

postgresql        14.9     maintenance    0/3  postgresql        14/stable      351  no       Beginning rolling restart

LXD:

Log output

Juju debug log:

unit-postgresql-0: 17:00:47 ERROR unit.postgresql/0.juju-log database-peers:4: Failed to get PostgreSQL version: connection to server at "10.130.13.84", port 5432 failed: FATAL:  the database system is shutting down

unit-postgresql-0: 17:00:47 ERROR unit.postgresql/0.juju-log database-peers:4: 
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-0/charm/lib/charms/postgresql_k8s/v0/postgresql.py", line 393, in get_postgresql_version
    with self._connect_to_database() as connection, connection.cursor() as cursor:
  File "/var/lib/juju/agents/unit-postgresql-0/charm/lib/charms/postgresql_k8s/v0/postgresql.py", line 115, in _connect_to_database
    connection = psycopg2.connect(
  File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "<REDACTED>", port 5432 failed: FATAL:  the database system is shutting down

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-0/charm/src/relations/db.py", line 291, in update_endpoints
    postgresql_version = self.charm.postgresql.get_postgresql_version()
  File "/var/lib/juju/agents/unit-postgresql-0/charm/lib/charms/postgresql_k8s/v0/postgresql.py", line 399, in get_postgresql_version
    raise PostgreSQLGetPostgreSQLVersionError()
charms.postgresql_k8s.v0.postgresql.PostgreSQLGetPostgreSQLVersionError
unit-postgresql-0: 17:00:47 INFO juju.worker.uniter.operation ran "database-peers-relation-changed" hook (via hook dispatching script: dispatch)
unit-postgresql-0: 17:00:48 INFO unit.postgresql/0.juju-log Cluster topology changed
unit-postgresql-0: 17:00:48 ERROR unit.postgresql/0.juju-log Failed to get PostgreSQL version: connection to server at "10.130.13.84", port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

unit-postgresql-0: 17:00:48 ERROR unit.postgresql/0.juju-log 
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-0/charm/lib/charms/postgresql_k8s/v0/postgresql.py", line 393, in get_postgresql_version
    with self._connect_to_database() as connection, connection.cursor() as cursor:
  File "/var/lib/juju/agents/unit-postgresql-0/charm/lib/charms/postgresql_k8s/v0/postgresql.py", line 115, in _connect_to_database
    connection = psycopg2.connect(
  File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "<REDACTED>", port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-0/charm/src/relations/db.py", line 291, in update_endpoints
    postgresql_version = self.charm.postgresql.get_postgresql_version()
  File "/var/lib/juju/agents/unit-postgresql-0/charm/lib/charms/postgresql_k8s/v0/postgresql.py", line 399, in get_postgresql_version
    raise PostgreSQLGetPostgreSQLVersionError()
charms.postgresql_k8s.v0.postgresql.PostgreSQLGetPostgreSQLVersionError

Additional context

github-actions[bot] commented 7 months ago

https://warthogs.atlassian.net/browse/DPE-3593

nishant-dash commented 7 months ago

wanted to add that at this point landscape (that uses pgsql) is done with server errors.

nishant-dash commented 7 months ago

unsetting the config options gets the postgresql charm into a "working" state again, but landscape is still down after this fiasco

after a while landscape seems to be ok

dragomirp commented 7 months ago

I'm unable to reproduce locally with latest 14/edge (rev. 378) on Juju 2.9.46. Please recheck.

nishant-dash commented 7 months ago

hi @dragomirp what about revision 351?

dragomirp commented 7 months ago

Hi, @nishant-dash. No, I'm unable to replicate locally for either 351 or 363 (latest 14/stable). What I'm deploying is the landscape-scalable bundle against a 3 unit deployment of the charm with the config options set (memory_shared_buffers=3128 memory_max_prepared_transactions=500)

dragomirp commented 7 months ago

I think the error happens transiently, when the charm is checking for configurations against the database (request_time_zone, instance_default_text_search_config and request_time_zone). If the charm cannot connect at the time to the DB. We should most probably not check if the values are the default (currently the charm checks for None, but the defaults would be set) and defer if the database is not available.

dragomirp commented 6 months ago

Latest 14/edge (Rev. 380) should check configs against the database only on config change and defer if the database is not available.