canonical / postgresql-operator

A Charmed Operator for running PostgreSQL on machines
https://charmhub.io/postgresql
Apache License 2.0
8 stars 19 forks source link

Charm installation fails with hook failed: "local-monitors-relation-changed" #102

Closed Sponge-Bas closed 1 year ago

Sponge-Bas commented 1 year ago

In test run 28fcfc8b-9180-48ed-9d51-7613538d11e9, we install a landscape bundle but the postgresql charm fails with status:

landscape-postgresql/0*       error        idle   2         10.246.166.55                                            hook failed: "local-monitors-relation-changed"
  canonical-livepatch/9       active       idle             10.246.166.55                                            Running kernel 5.15.0-69.76-generic, patchState: nothing-to-apply (source version/commit f1e83ae)
  filebeat/9                  active       idle             10.246.166.55                                            Filebeat ready.
  landscape-client/9          maintenance  idle             10.246.166.55                                            Need computer-title and juju-info to proceed
  logrotated/9                active       idle             10.246.166.55                                            Unit is ready.
  nrpe/9                      active       idle             10.246.166.55   icmp,5666/tcp                            Ready
  ntp/9                       active       idle             10.246.166.55   123/udp                                  chrony: Ready, OK: offset is 0.000054
  telegraf/9                  active       idle             10.246.166.55   9103/tcp                                 Monitoring landscape-postgresql/0 (source version/commit 23.01-4-...)
landscape-rabbitmq-server/0*  waiting      idle   7         10.246.166.155  5672/tcp,15672/tcp                       Not reached target cluster-partition-handling mode
  canonical-livepatch/3       active       idle             10.246.166.155                                           Running kernel 5.15.0-69.76-generic, patchState: nothing-to-apply (source version/commit f1e83ae)
  filebeat/3                  active       idle             10.246.166.155                                           Filebeat ready.
  landscape-client/3          maintenance  idle             10.246.166.155                                           Need computer-title and juju-info to proceed
  logrotated/3                active       idle             10.246.166.155                                           Unit is ready.
  nrpe/3                      active       idle             10.246.166.155  icmp,5666/tcp                            Ready
  ntp/4                       active       idle             10.246.166.155  123/udp                                  chrony: Ready, OK: offset is 0.000053
  telegraf/3                  active       idle             10.246.166.155  9103/tcp                                 Monitoring landscape-rabbitmq-server/0 (source version/commit 23.01-4-...)
landscape-server-haproxy/0*   active       idle   5         10.246.165.90   80/tcp,443/tcp                           Unit is ready
  canonical-livepatch/1       active       idle             10.246.165.90                                            Running kernel 5.4.0-146.163-generic, patchState: nothing-to-apply (source version/commit f1e83ae)
  filebeat/1                  active       idle             10.246.165.90                                            Filebeat ready.
  landscape-client/1          maintenance  idle             10.246.165.90                                            Need computer-title and juju-info to proceed
  logrotated/1                active       idle             10.246.165.90                                            Unit is ready.
  nrpe/1                      active       idle             10.246.165.90   icmp,5666/tcp                            Ready
  ntp/1                       active       idle             10.246.165.90   123/udp                                  chrony: Ready, OK: offset is 0.000184
  telegraf/1                  active       idle             10.246.165.90   9103/tcp                                 Monitoring landscape-server-haproxy/0 (source version/commit 23.01-4-...)
landscape-server/0*           waiting      idle   6         10.246.167.157                                           Waiting on relations: db
  canonical-livepatch/11      active       idle             10.246.167.157                                           Running kernel 5.15.0-69.76-generic, patchState: nothing-to-apply (source version/commit f1e83ae)
  filebeat/11                 active       idle             10.246.167.157                                           Filebeat ready.
  landscape-client/11         maintenance  idle             10.246.167.157                                           Need computer-title and juju-info to proceed
  logrotated/11               active       idle             10.246.167.157                                           Unit is ready.
  nrpe/11                     active       idle             10.246.167.157  icmp,5666/tcp                            Ready
  ntp/11                      active       idle             10.246.167.157  123/udp                                  chrony: Ready, OK: offset is 0.000163
  telegraf/11                 active       idle             10.246.167.157  9103/tcp                                 Monitoring landscape-server/0 (source version/commit 23.01-4-...)

In the debug-log we see:

unit-landscape-postgresql-0: 09:34:36 INFO unit.landscape-postgresql/0.juju-log local-monitors:63: Setting charm primary status True
unit-landscape-postgresql-0: 09:34:36 ERROR unit.landscape-postgresql/0.juju-log local-monitors:63: Hook error:
Traceback (most recent call last):
  File "/usr/lib/python3.10/shutil.py", line 815, in move
    os.rename(src, real_dst)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpx8jh2efa' -> '/var/lib/postgresql/scripts/find_latest_ready_wal.py'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-landscape-postgresql-0/.venv/lib/python3.10/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-landscape-postgresql-0/.venv/lib/python3.10/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-landscape-postgresql-0/.venv/lib/python3.10/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-landscape-postgresql-0/.venv/lib/python3.10/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-landscape-postgresql-0/charm/reactive/postgresql/nagios.py", line 117, in update_nrpe_config
    helpers.write(check_script_path, check_script, mode=0o755)
  File "/var/lib/juju/agents/unit-landscape-postgresql-0/charm/reactive/postgresql/helpers.py", line 75, in write
    shutil.move(f.name, path)
  File "/usr/lib/python3.10/shutil.py", line 835, in move
    copy_function(src, real_dst)
  File "/usr/lib/python3.10/shutil.py", line 434, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/lib/python3.10/shutil.py", line 256, in copyfile
    with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/postgresql/scripts/find_latest_ready_wal.py'

Maybe the /var/lib/postgresql/scripts/ dir needs to be created first? I'm not sure why it does install successfully in some cases.

Crashdumps and configs can be found here: https://oil-jenkins.canonical.com/artifacts/28fcfc8b-9180-48ed-9d51-7613538d11e9/index.html

Sponge-Bas commented 1 year ago

Judging by the test runs that hit this bug, I think its only a jammy issue. Focal seems to work fine.

marceloneppel commented 1 year ago

Hi @Basdbruijne! Thanks for the report. Could you confirm the channel you're using for the PostgreSQL charm? I think you're using stable, right? If so, you're using the legacy PostgreSQL charm.

We're currently fixing some issues that we saw when relating this new charm with Landscape in https://github.com/canonical/postgresql-operator/pull/93.

Sponge-Bas commented 1 year ago

We are indeed using stable (rev 282), I will move this bug there.

When is the data platform team expected to take over this charm?

Sponge-Bas commented 1 year ago

Bug moved to LP: #2015344