canonical / postgresql-k8s-operator

A Charmed Operator for running PostgreSQL on Kubernetes
https://charmhub.io/postgresql-k8s
Apache License 2.0
10 stars 19 forks source link

Charm stuck on JAAS #309

Closed marceloneppel closed 1 month ago

marceloneppel commented 11 months ago

Steps to reproduce

Deploy PostgreSQL K8S from stable in jimm.operatorinc.org

Expected behavior

Charm starts correctly.

Actual behavior

Charm stuck with the awaiting for cluster to start message.

Versions

Juju CLI: 3.1.6

Juju agent: 3.1.5

Charm revision: 158

Pebble logs:

root@pge-0:/# pebble logs
2023-11-01T09:31:09.897Z [postgresql] The files belonging to this database system will be owned by user "postgres".
2023-11-01T09:31:09.897Z [postgresql] This user must also own the server process.
2023-11-01T09:31:09.897Z [postgresql] 
2023-11-01T09:31:09.897Z [postgresql] The database cluster will be initialized with locales
2023-11-01T09:31:09.897Z [postgresql]   COLLATE:  C
2023-11-01T09:31:09.897Z [postgresql]   CTYPE:    C.UTF-8
2023-11-01T09:31:09.897Z [postgresql]   MESSAGES: C
2023-11-01T09:31:09.897Z [postgresql]   MONETARY: C
2023-11-01T09:31:09.897Z [postgresql]   NUMERIC:  C
2023-11-01T09:31:09.897Z [postgresql]   TIME:     C
2023-11-01T09:31:09.897Z [postgresql] The default text search configuration will be set to "english".
2023-11-01T09:31:09.897Z [postgresql] 
2023-11-01T09:31:09.897Z [postgresql] Data page checksums are enabled.
2023-11-01T09:31:09.897Z [postgresql] 
2023-11-01T09:31:09.897Z [postgresql] creating directory /var/lib/postgresql/data/pgdata ... ok
2023-11-01T09:31:09.898Z [postgresql] creating subdirectories ... ok
2023-11-01T09:31:09.899Z [postgresql] selecting dynamic shared memory implementation ... posix
2023-11-01T09:31:09.899Z [postgresql] selecting default max_connections ... 20
2023-11-01T09:31:10.739Z [postgresql] selecting default shared_buffers ... 400kB
2023-11-01T09:31:14.325Z [postgresql] selecting default time zone ... Etc/UTC
2023-11-01T09:31:14.344Z [postgresql] creating configuration files ... ok
2023-11-01T09:31:14.345Z [postgresql] running bootstrap script ... Bus error (core dumped)
2023-11-01T09:31:14.509Z [postgresql] child process exited with exit code 135
2023-11-01T09:31:14.509Z [postgresql] initdb: removing data directory "/var/lib/postgresql/data/pgdata"
2023-11-01T09:31:14.511Z [postgresql] pg_ctl: database system initialization failed
2023-11-01T09:31:15.084Z [postgresql] Traceback (most recent call last):
2023-11-01T09:31:15.084Z [postgresql]   File "/usr/bin/patroni", line 33, in <module>
2023-11-01T09:31:15.084Z [postgresql]     sys.exit(load_entry_point('patroni==3.0.2', 'console_scripts', 'patroni')())
2023-11-01T09:31:15.084Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/__main__.py", line 144, in main
2023-11-01T09:31:15.084Z [postgresql]     return patroni_main()
2023-11-01T09:31:15.085Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/__main__.py", line 136, in patroni_main
2023-11-01T09:31:15.085Z [postgresql]     abstract_main(Patroni, schema)
2023-11-01T09:31:15.085Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/daemon.py", line 181, in abstract_main
2023-11-01T09:31:15.085Z [postgresql]     controller.run()
2023-11-01T09:31:15.085Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/__main__.py", line 106, in run
2023-11-01T09:31:15.085Z [postgresql]     super(Patroni, self).run()
2023-11-01T09:31:15.086Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/daemon.py", line 126, in run
2023-11-01T09:31:15.086Z [postgresql]     self._run_cycle()
2023-11-01T09:31:15.086Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/__main__.py", line 109, in _run_cycle
2023-11-01T09:31:15.086Z [postgresql]     logger.info(self.ha.run_cycle())
2023-11-01T09:31:15.086Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1770, in run_cycle
2023-11-01T09:31:15.086Z [postgresql]     info = self._run_cycle()
2023-11-01T09:31:15.087Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1592, in _run_cycle
2023-11-01T09:31:15.087Z [postgresql]     return self.post_bootstrap()
2023-11-01T09:31:15.087Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1483, in post_bootstrap
2023-11-01T09:31:15.087Z [postgresql]     self.cancel_initialization()
2023-11-01T09:31:15.087Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1476, in cancel_initialization
2023-11-01T09:31:15.088Z [postgresql]     raise PatroniFatalException('Failed to bootstrap cluster')
2023-11-01T09:31:15.088Z [postgresql] patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'

Additional context

Huge pages setting in the unit:

root@pge-0:/# sysctl -a | grep hugepage
sysctl: reading key "kernel.apparmor_display_secid_mode"
sysctl: reading key "kernel.unprivileged_userns_apparmor_policy"
vm.nr_hugepages = 1024
vm.nr_hugepages_mempolicy = 1024
vm.nr_overcommit_hugepages = 0
root@pge-0:/#

MM discussion

github-actions[bot] commented 11 months ago

https://warthogs.atlassian.net/browse/DPE-2880

taurus-forever commented 11 months ago

From https://chat.charmhub.io/charmhub/pl/yg6asxao6f8hpn93xpq83efrte Marcelo wrote:

I created a new revision for the PostgreSQL charm on the 14/edge/test channel. It should have the fix to start PostgreSQL (and avoid it being stuck with the awaiting for cluster to start message). I couldn't reproduce the Juju secrets issue that you showed me in the logs, but now the PostgreSQL charm should be able to start.
....
If that revision works, I should create a pull request to later publish the revision correctly to the 14/edge channel.

We are waiting for the fix confirmation to merge in edge...

taurus-forever commented 9 months ago

@marceloneppel what is our plan here?

marceloneppel commented 9 months ago

We'll need some help from Alex Kilroy to bootstrap the environment again. He was busy, so that we couldn't progress in this task.

taurus-forever commented 4 months ago

Hi @ale8k , is it still reproducible on JAAS (with 14/stable or 14/candidate we are preparing for stable release now)? If so, can you help us with env to reproduce and troubleshoot there? Tnx!

taurus-forever commented 1 month ago

Dear @ale8k are there any place we can reproduce this issue (see my comment above)? Tnx!

ale8k commented 1 month ago

Hi @taurus-forever, I'm unsure on how to reproduce this... Perhaps @kian99 knows?

ale8k commented 1 month ago

Ahh! This can be ignored, operator day has passed.

taurus-forever commented 1 month ago

Resolving the ticket as no longer topical.

Data Team is still interested in JAAS deployment/testing and searching the environment to test and document it (separate story).

Tnx!