canonical / postgresql-operator

A Charmed Operator for running PostgreSQL on machines
https://charmhub.io/postgresql
Apache License 2.0
8 stars 19 forks source link

Clustered postgresql charm does not update its relations when the Primary unit changes. #363

Closed claudiubelu closed 6 months ago

claudiubelu commented 7 months ago

Steps to reproduce

Currently, we're using the postgresql charm (number of units: 3, channel: 14/stable, rev: 351) in relation with the landscape-server charm. Everything works fine, the postgresql unit Primary changes. From then on, the related application crashes, as it is now connected to a read-only PostgreSQL database. Note the error storm.database.ReadOnlySqlTransaction: cannot execute UPDATE in a read-only transaction below:

Feb 20 08:16:08 landscape-2 message-server-1[1914]: Got error while handling message\nTraceback (most recent call last):\n  File \"/usr/lib/python3.10/threading.py\", line 1016, in _bootstrap_inner\n    self.run()\n  File \"/usr/lib/python3.10/threading.py\", line 953, in run\n    self._target(*self._args, **self._kwargs)\n  File \"/usr/lib/python3/dist-packages/twisted/_threads/_threadworker.py\", line 47, in work\n    task()\n  File \"/usr/lib/python3/dist-packages/twisted/_threads/_team.py\", line 182, in doWork\n    task()\n--- <exception caught here> ---\n  File \"/usr/lib/python3/dist-packages/twisted/python/threadpool.py\", line 244, in inContext\n    result = inContext.theWork()  # type: ignore[attr-defined]\n  File \"/usr/lib/python3/dist-packages/twisted/python/threadpool.py\", line 260, in <lambda>\n    inContext.theWork = lambda: context.call(  # type: ignore[attr-defined]\n  File \"/usr/lib/python3/dist-packages/twisted/python/context.py\", line 117, in callWithContext\n    return self.currentContext().callWithContext(ctx, func, *args, **kw)\n  File \"/usr/lib/python3/dist-packages/twisted/python/context.py\", line 82, in callWithContext\n    return func(*args, **kw)\n  File \"/usr/lib/python3/dist-packages/storm/twisted/transact.py\", line 78, in _wrap\n    result = function(*args, **kwargs)\n  File \"/opt/canonical/landscape/canonical/landscape/message/resource.py\", line 136, in _handle_request\n    result = self._message_system.dispatch(request)\n  File \"/opt/canonical/landscape/canonical/message/system.py\", line 73, in dispatch\n    return message_api.run()\n  File \"/opt/canonical/landscape/canonical/landscape/message/apis.py\", line 265, in run\n    self.transaction.commit()\n  File \"/usr/lib/python3/dist-packages/transaction/_manager.py\", line 257, in commit\n    return self.manager.commit()\n  File \"/usr/lib/python3/dist-packages/transaction/_manager.py\", line 135, in commit\n    return self.get().commit()\n  File \"/usr/lib/python3/dist-packages/transaction/_transaction.py\", line 282, in commit\n    reraise(t, v, tb)\n  File \"/usr/lib/python3/dist-packages/transaction/_compat.py\", line 45, in reraise\n    raise value\n  File \"/usr/lib/python3/dist-packages/transaction/_transaction.py\", line 273, in commit\n    self._commitResources()\n  File \"/usr/lib/python3/dist-packages/transaction/_transaction.py\", line 465, in _commitResources\n    reraise(t, v, tb)\n  File \"/usr/lib/python3/dist-packages/transaction/_compat.py\", line 45, in reraise\n    raise value\n  File \"/usr/lib/python3/dist-packages/transaction/_transaction.py\", line 437, in _commitResources\n    rm.tpc_begin(self)\n  File \"/usr/lib/python3/dist-packages/storm/zope/zstorm.py\", line 298, in tpc_begin\n    self._store.flush()\n  File \"/usr/lib/python3/dist-packages/storm/store.py\", line 511, in flush\n    self._flush_one(obj_info)\n  File \"/usr/lib/python3/dist-packages/storm/store.py\", line 568, in _flush_one\n    self._connection.execute(expr, noresult=True)\n  File \"/usr/lib/python3/dist-packages/storm/databases/postgres.py\", line 306, in execute\n    return Connection.execute(self, statement, params, noresult)\n  File \"/usr/lib/python3/dist-packages/storm/database.py\", line 323, in execute\n    raw_cursor = self.raw_execute(statement, params)\n  File \"/usr/lib/python3/dist-packages/storm/database.py\", line 457, in raw_execute\n    self._run_execution(raw_cursor, args, params, statement)\n  File \"/usr/lib/python3/dist-packages/storm/database.py\", line 471, in _run_execution\n    self._check_disconnect(raw_cursor.execute, *args)\n  File \"/usr/lib/python3/dist-packages/storm/database.py\", line 537, in _check_disconnect\n    return function(*args, **kwargs)\n  File \"/usr/lib/python3/dist-packages/storm/database.py\", line 186, in wrapper\n    with wrap_exceptions(self._database):\n  File \"/usr/lib/python3.10/contextlib.py\", line 153, in __exit__\n    self.gen.throw(typ, value, traceback)\n  File \"/usr/lib/python3/dist-packages/storm/exceptions.py\", line 184, in wrap_exceptions\n    six.raise_from(wrapped.with_traceback(tb), e)\n  File \"<string>\", line 3, in raise_from\n    \n  File \"/usr/lib/python3/dist-packages/storm/exceptions.py\", line 165, in wrap_exceptions\n    yield\n  File \"/usr/lib/python3/dist-packages/storm/database.py\", line 187, in wrapper\n    return attr(*args, **kwargs)\nstorm.database.ReadOnlySqlTransaction: cannot execute UPDATE in a read-only transaction\n\n
Feb 20 08:16:08 landscape-2 message-server-1[1914]: \"10.252.20.68\" - - [20/Feb/2024:08:16:08 +0000] \"POST /message-system HTTP/1.1\" 500 - \"-\" \"landscape-client/18.01-0ubuntu13\"

Or in one of its other services:

Package Search daemon.
-search 2024/02/20 08:28:36 Starting up, revision
-search 2024/02/20 08:28:36 pq: cannot set transaction read-write mode during recovery
search.service: Main process exited, code=exited, status=1/FAILURE
search.service: Failed with result 'exit-code'.
search.service: Consumed 1.150s CPU time.

After some investigation, we've seen the following landscape-server relation data:

juju show-unit -m landscape-maas landscape-server/0
landscape-server/0:
  ...
  - relation-id: 10
    endpoint: db
    related-endpoint: db-admin
    application-data: {}
    related-units:
      landscape-postgresql/0:
        in-scope: true
        data:
          allowed-subnets: 10.252.20.109/32,10.252.20.27/32,10.252.20.66/32
          allowed-units: landscape-server/0 landscape-server/1 landscape-server/2
          database: landscape-server
          egress-subnets: 10.252.20.58/32
          host: 10.252.20.58
          ingress-address: 10.252.20.58
          master: dbname=landscape-server host=10.252.20.58 password=removed
            port=5432 user=relation-10
          password: removed
          port: "5432"
          private-address: 10.252.20.58
          schema_password: removed
          schema_user: relation-10
          standbys: dbname=landscape-server host=10.252.20.54 password=removed
            port=5432 user=relation-10,dbname=landscape-server host=10.252.20.92 password=removed
            port=5432 user=relation-10
          state: master
          user: relation-10
          version: "14.9"
      landscape-postgresql/1:
        in-scope: true
        data:
          egress-subnets: 10.252.20.92/32
          ingress-address: 10.252.20.92
          private-address: 10.252.20.92
      landscape-postgresql/2:
        in-scope: true
        data:
          egress-subnets: 10.252.20.54/32
          ingress-address: 10.252.20.54
          private-address: 10.252.20.54

As it can be seen from the relation data, the master is: "master: dbname=landscape-server host=10.252.20.58". Though, it is not. If we check the model, we see this:

juju status -m landscape-maas landscape-postgresql
Model           Controller            Cloud/Region  Version  SLA          Timestamp
landscape-maas  juju_maas_controller  maas_cloud    3.3.1    unsupported  08:40:49Z

App                   Version  Status  Scale  Charm             Channel    Rev  Exposed  Message
landscape-client               active      3  landscape-client  stable      69  no       Client registered!
landscape-postgresql  14.9     active      3  postgresql        14/stable  351  no
ubuntu-advantage               active      3  ubuntu-advantage  stable      71  no       Attached (esm-infra,livepatch)

Unit                     Workload  Agent  Machine  Public address  Ports     Message
landscape-postgresql/0*  active    idle   5        10.252.20.58    5432/tcp
  landscape-client/1     active    idle            10.252.20.58              Client registered!
  ubuntu-advantage/2     active    idle            10.252.20.58              Attached (esm-infra,livepatch)
landscape-postgresql/1   active    idle   7        10.252.20.92    5432/tcp  Primary
  landscape-client/7     active    idle            10.252.20.92              Client registered!
  ubuntu-advantage/7     active    idle            10.252.20.92              Attached (esm-infra,livepatch)
landscape-postgresql/2   active    idle   2        10.252.20.54    5432/tcp
  landscape-client/0*    active    idle            10.252.20.54              Client registered!
  ubuntu-advantage/0*    active    idle            10.252.20.54              Attached (esm-infra,livepatch)

Machine  State    Address       Inst id         Base          AZ     Message
2        started  10.252.20.54  landscapesql-1  ubuntu@22.04  zone1  Deployed
5        started  10.252.20.58  landscapesql-2  ubuntu@22.04  zone2  Deployed
7        started  10.252.20.92  landscapesql-3  ubuntu@22.04  zone3  Deployed

Offer             Application       Charm             Rev  Connected  Endpoint               Interface             Role
landscape-server  landscape-server  landscape-server  107  1/1        application-dashboard  register-application  requirer

Checking the landscape-server units, we can see that they were configured to use a Secondary PostgreSQL server:

cat /etc/landscape/service.conf
[stores]
user = landscape
password = removed
host = 10.252.20.58:5432

This causes the Landscape server to crash, and result in "System errors" when accessing its dashboard, and clients failing to register to it.

Expected behavior

In case a new Primary is elected, the postgresql charm should update its relations with the "master", so the requirer charms can update themselves accordingly.

Actual behavior

Relation data remains the same, as seen above.

Versions

Operating system: Ubuntu 22.04.4 LTS

Juju CLI: 3.3.1-genericlinux-amd64

Juju agent: 3.3.1

Charm revision: channel 14/stable, revision 351

LXD: 5.0.3

Log output

Juju debug log:

Additional context

github-actions[bot] commented 7 months ago

https://warthogs.atlassian.net/browse/DPE-3616

dragomirp commented 7 months ago

Hi, @claudiubelu, I was able to replicate locally on revision 351, but not on the latest 14/stable (rev. 363). There were issues with juju secrets on 351 that I believe are the root cause for this. Can you retry with rev. 363?

taurus-forever commented 6 months ago

Dear @claudiubelu the revision 363 is now in 14/stable. I am resolving this issue due to lack of your reply on the message above.

Feel free to reopen the issue if you are still experiencing this issue. Thank you for the bugreport!