canonical / postgresql-k8s-operator

A Charmed Operator for running PostgreSQL on Kubernetes
https://charmhub.io/postgresql-k8s
Apache License 2.0
10 stars 20 forks source link

Units stuck in `reinitialising replica` and `awaiting for cluster to start` #684

Open kelkawi-a opened 2 months ago

kelkawi-a commented 2 months ago

Steps to reproduce

  1. Deploy 3 units of postgresql-k8s charm, channel 14/stable revision 281

Expected behavior

The units remain in an active state

Actual behavior

After running fine for a while (i.e. all three units were active and functional", two of the three units became stuck in a waiting/maintenance state with the following status:

postgresql-k8s/0                     active       idle              Primary
postgresql-k8s/1                     waiting      idle             awaiting for cluster to start
postgresql-k8s/2*                    maintenance  idle             reinitialising replica

Versions

Operating system: Ubuntu 22.04.4 LTS

Juju CLI: 3.5.3-ubuntu-amd64

Juju agent: 3.5.3

Charm revision: 281, channel 14/stable

kubectl: Client Version: v1.30.4 Server Version: v1.26.15

Log output

Juju debug log:

Output of juju debug-log --include postgresql-k8s/<unit_number>:

postgresql-1.log postgresql-2.log

Output of juju show-status-log of unit 1:

Time                   Type       Status       Message
03 Sep 2024 14:52:21Z  workload   active       Primary
03 Sep 2024 15:18:10Z  juju-unit  error        hook failed: "update-status"
03 Sep 2024 15:20:05Z  workload   maintenance  stopping charm software
03 Sep 2024 15:20:05Z  juju-unit  executing    running stop hook
03 Sep 2024 15:20:12Z  workload   maintenance  
03 Sep 2024 15:20:12Z  juju-unit  executing    running start hook
03 Sep 2024 15:20:16Z  juju-unit  executing    running leader-settings-changed hook
03 Sep 2024 15:20:17Z  juju-unit  executing    running postgresql-pebble-ready hook
03 Sep 2024 15:20:19Z  workload   maintenance  stopping charm software
03 Sep 2024 15:20:19Z  juju-unit  executing    running stop hook
03 Sep 2024 15:20:21Z  workload   maintenance  
03 Sep 2024 15:27:45Z  juju-unit  executing    running upgrade-charm hook
03 Sep 2024 15:27:59Z  juju-unit  executing    running config-changed hook
03 Sep 2024 15:28:01Z  juju-unit  executing    running start hook
03 Sep 2024 15:28:04Z  juju-unit  executing    running leader-settings-changed hook
03 Sep 2024 15:28:06Z  juju-unit  executing    running postgresql-pebble-ready hook
03 Sep 2024 22:22:38Z  juju-unit  idle         
03 Sep 2024 22:23:42Z  juju-unit  error        hook failed: "update-status"
04 Sep 2024 09:23:17Z  juju-unit  idle         
04 Sep 2024 09:23:17Z  workload   waiting      awaiting for cluster to start

Output of juju show-status-log of unit 2:

Time                   Type       Status       Message
03 Sep 2024 11:41:52Z  workload   maintenance  stopping charm software
03 Sep 2024 11:41:52Z  juju-unit  executing    running stop hook
03 Sep 2024 11:42:02Z  workload   maintenance  
03 Sep 2024 11:42:03Z  juju-unit  executing    running start hook
03 Sep 2024 11:49:32Z  juju-unit  error        hook failed: "start"
03 Sep 2024 11:49:38Z  juju-unit  executing    running start hook
03 Sep 2024 11:49:50Z  juju-unit  executing    running leader-settings-changed hook
03 Sep 2024 11:49:51Z  juju-unit  executing    running postgresql-pebble-ready hook
03 Sep 2024 11:50:45Z  workload   waiting      awaiting for cluster to start
03 Sep 2024 11:50:53Z  workload   waiting      Updating extensions
03 Sep 2024 11:50:53Z  workload   waiting      awaiting for cluster to start
03 Sep 2024 11:50:53Z  workload   active       
03 Sep 2024 14:52:10Z  juju-unit  idle         
03 Sep 2024 14:59:11Z  juju-unit  executing    running leader-elected hook
03 Sep 2024 15:31:18Z  workload   maintenance  reinitialising replica
03 Sep 2024 15:32:05Z  workload   active       
03 Sep 2024 22:23:01Z  juju-unit  idle         
03 Sep 2024 22:23:21Z  juju-unit  error        hook failed: "update-status"
04 Sep 2024 09:24:11Z  juju-unit  idle         
04 Sep 2024 09:24:11Z  workload   maintenance  reinitialising replica

Patroni logs:

Unit 1:

2024-09-04 09:43:37 UTC [16]: WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role 
2024-09-04 09:43:37 UTC [16]: INFO: no action. I am (postgresql-k8s-1), a secondary, and following a leader (postgresql-k8s-0) 
2024-09-04 09:43:36 UTC [16]: INFO: Lock owner: postgresql-k8s-0; I am postgresql-k8s-1 
2024-09-04 09:43:36 UTC [16]: INFO: Still starting up as a standby. 
2024-09-04 09:43:36 UTC [16]: INFO: Lock owner: postgresql-k8s-0; I am postgresql-k8s-1 
2024-09-04 09:43:36 UTC [16]: INFO: establishing a new patroni connection to the postgres cluster 
2024-09-04 09:43:37 UTC [16]: INFO: establishing a new patroni connection to the postgres cluster 
2024-09-04 09:43:37 UTC [16]: WARNING: Retry got exception: connection problems 
2024-09-04 09:43:26 UTC [16]: INFO: establishing a new patroni connection to the postgres cluster 
2024-09-04 09:43:26 UTC [16]: INFO: establishing a new patroni connection to the postgres cluster 
2024-09-04 09:43:26 UTC [16]: WARNING: Retry got exception: connection problems 
2024-09-04 09:43:26 UTC [16]: WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role 
2024-09-04 09:43:26 UTC [16]: INFO: no action. I am (postgresql-k8s-1), a secondary, and following a leader (postgresql-k8s-0) 

Unit 2:

2024-09-04 09:43:56 UTC [14833]: INFO: Lock owner: postgresql-k8s-0; I am postgresql-k8s-2 
2024-09-04 09:43:56 UTC [14833]: INFO: reinitialize in progress 
2024-09-04 09:44:06 UTC [14833]: INFO: Lock owner: postgresql-k8s-0; I am postgresql-k8s-2 
2024-09-04 09:44:06 UTC [14833]: INFO: reinitialize in progress 
2024-09-04 09:44:06 UTC [14833]: INFO: Lock owner: postgresql-k8s-0; I am postgresql-k8s-2 
2024-09-04 09:44:06 UTC [14833]: INFO: reinitialize in progress 
2024-09-04 09:43:55 UTC [14833]: ERROR: Could not rename data directory /var/lib/postgresql/data/pgdata 
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 1314, in remove_data_directory
    shutil.rmtree(self._data_dir)
  File "/usr/lib/python3.10/shutil.py", line 731, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 729, in rmtree
    os.rmdir(path)
PermissionError: [Errno 13] Permission denied: '/var/lib/postgresql/data/pgdata'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 1287, in move_data_directory
    os.rename(self._data_dir, new_name)
PermissionError: [Errno 13] Permission denied: '/var/lib/postgresql/data/pgdata' -> '/var/lib/postgresql/data/pgdata.failed'
2024-09-04 09:43:55 UTC [14833]: INFO: renaming data directory to /var/lib/postgresql/data/pgdata.failed 
syncronize-issues-to-jira[bot] commented 2 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5335.

This message was autogenerated

marceloneppel commented 2 months ago

Hi, @kelkawi-a!

Do you know if the cluster was restarted or upgraded in some way? I see the following hook being fired on Unit 1 in the logs that you shared:

03 Sep 2024 15:27:45Z  juju-unit  executing    running upgrade-charm hook

Could you share some logs from Unit 1 so we can understand what's happening?

juju show-unit postgresql-k8s/1

juju ssh --container postgresql postgresql-k8s/1 pebble services
juju ssh --container charm postgresql-k8s/1 curl localhost:8008/cluster
juju ssh --container charm postgresql-k8s/0 curl localhost:8008/cluster
juju ssh --container charm postgresql-k8s/0 curl localhost:8008/history

juju ssh --container postgresql postgresql-k8s/1 cat /var/log/postgresql/patroni.log /var/log/postgresql/patroni.log.1 /var/log/postgresql/patroni.log.2

juju ssh --container postgresql postgresql-k8s/1 "find /var/log/postgresql/ -name postgresql*.log -not -empty -exec ls {} \; -exec cat {} \;"

If you're using TLS, you should use curl -k https://localhost:8008/xxx in the above commands.

The following error on Unit 2 has been fixed in revisions 332 and 333 from the 14/edge channel (https://github.com/canonical/postgresql-k8s-operator/pull/580) and will be part of the next revision on the 14/stable channel.

PermissionError: [Errno 13] Permission denied: '/var/lib/postgresql/data/pgdata'

Right now, to fix Unit 2, you can run the following command:

juju ssh --container postgresql postgresql-k8s/2 chown postgres:postgres /var/lib/postgresql/data
kelkawi-a commented 1 month ago

@marceloneppel thanks for investigating. The cluster is not managed by our team so I don't have visibility on whether or not the cluster was restarted.

Below are the requested logs:

juju ssh --container postgresql postgresql-k8s/1 pebble services:

Service            Startup   Current   Since
metrics_server     enabled   active    2 days ago, at 15:27 UTC
pgbackrest server  disabled  inactive  -
postgresql         enabled   active    2 days ago, at 15:27 UTC

juju ssh --container charm postgresql-k8s/1 curl localhost:8008/cluster:

{"members": [{"name": "postgresql-k8s-0", "role": "leader", "state": "running", "api_url": "http://postgresql-k8s-0.postgresql-k8s-endpoints:8008/patroni", "host": "postgresql-k8s-0.postgresql-k8s-endpoints", "port": 5432, "timeline": 213}, {"name": "postgresql-k8s-1", "role": "replica", "state": "starting", "api_url": "http://postgresql-k8s-1.postgresql-k8s-endpoints:8008/patroni", "host": "postgresql-k8s-1.postgresql-k8s-endpoints", "port": 5432, "lag": "unknown"}]}

juju ssh --container charm postgresql-k8s/0 curl localhost:8008/cluster:

{"members": [{"name": "postgresql-k8s-0", "role": "leader", "state": "running", "api_url": "http://postgresql-k8s-0.postgresql-k8s-endpoints:8008/patroni", "host": "postgresql-k8s-0.postgresql-k8s-endpoints", "port": 5432, "timeline": 213}, {"name": "postgresql-k8s-1", "role": "replica", "state": "starting", "api_url": "http://postgresql-k8s-1.postgresql-k8s-endpoints:8008/patroni", "host": "postgresql-k8s-1.postgresql-k8s-endpoints", "port": 5432, "lag": "unknown"}]}

juju ssh --container charm postgresql-k8s/0 curl localhost:8008/history:

[[1, 181796616, "no recovery target specified", "2024-08-20T15:39:40.956551+00:00", "postgresql-k8s-1"], [2, 251658400, "no recovery target specified", "2024-08-20T16:34:22.832061+00:00", "postgresql-k8s-1"], [3, 268435616, "no recovery target specified", "2024-08-20T16:34:32.856157+00:00", "postgresql-k8s-1"], [4, 402653344, "no recovery target specified", "2024-08-20T18:56:24.021318+00:00", "postgresql-k8s-1"], [5, 514753096, "no recovery target specified", "2024-08-20T22:37:19.281941+00:00", "postgresql-k8s-2"], [6, 514877944, "no recovery target specified", "2024-08-20T22:38:08.654025+00:00", "postgresql-k8s-2"], [7, 721420448, "no recovery target specified", "2024-08-21T06:51:58.300517+00:00", "postgresql-k8s-2"], [8, 730339744, "no recovery target specified", "2024-08-21T07:10:08.821816+00:00", "postgresql-k8s-1"], [9, 762367408, "no recovery target specified", "2024-08-21T08:41:35.722519+00:00", "postgresql-k8s-1"], [10, 771752096, "no recovery target specified", "2024-08-21T08:59:49.737028+00:00", "postgresql-k8s-1"], [11, 788529312, "no recovery target specified", "2024-08-21T09:15:15.401925+00:00", "postgresql-k8s-1"], [12, 833224328, "no recovery target specified", "2024-08-21T09:36:03.989966+00:00", "postgresql-k8s-1"], [13, 905969824, "no recovery target specified", "2024-08-21T09:39:09.965176+00:00", "postgresql-k8s-0"], [14, 939524256, "no recovery target specified", "2024-08-21T10:40:31.448781+00:00", "postgresql-k8s-0"], [15, 956301472, "no recovery target specified", "2024-08-21T10:41:52.053647+00:00", "postgresql-k8s-0"], [16, 958223832, "no recovery target specified", "2024-08-21T10:45:13.033993+00:00", "postgresql-k8s-1"], [17, 1140850848, "no recovery target specified", "2024-08-21T15:34:37.070832+00:00", "postgresql-k8s-1"], [18, 1241514144, "no recovery target specified", "2024-08-21T16:18:22.976333+00:00", "postgresql-k8s-1"], [19, 1258291360, "no recovery target specified", "2024-08-21T16:25:41.418802+00:00", "postgresql-k8s-2"], [20, 1292018432, "no recovery target specified", "2024-08-21T16:26:19.684027+00:00", "postgresql-k8s-0"], [21, 1516027008, "no recovery target specified", "2024-08-21T22:39:15.192982+00:00", "postgresql-k8s-0"], [22, 1543504032, "no recovery target specified", "2024-08-21T23:26:38.412046+00:00", "postgresql-k8s-1"], [23, 1593835680, "no recovery target specified", "2024-08-21T23:29:26.665688+00:00", "postgresql-k8s-1"], [24, 1610612896, "no recovery target specified", "2024-08-21T23:46:20.285545+00:00", "postgresql-k8s-0"], [25, 1660944544, "no recovery target specified", "2024-08-21T23:48:30.511019+00:00", "postgresql-k8s-0"], [26, 1711276192, "no recovery target specified", "2024-08-22T00:12:43.699291+00:00", "postgresql-k8s-0"], [27, 1728053408, "no recovery target specified"], [28, 1744988592, "no recovery target specified", "2024-08-22T00:14:57.120767+00:00", "postgresql-k8s-0"], [29, 1795162272, "no recovery target specified", "2024-08-22T00:16:12.049596+00:00", "postgresql-k8s-0"], [30, 1812354264, "no recovery target specified", "2024-08-22T00:20:34.588501+00:00", "postgresql-k8s-0"], [31, 1813536080, "no recovery target specified", "2024-08-22T00:22:15.579347+00:00", "postgresql-k8s-2"], [32, 1828716704, "no recovery target specified", "2024-08-22T00:37:41.109522+00:00", "postgresql-k8s-1"], [33, 1879048352, "no recovery target specified", "2024-08-22T00:48:01.151554+00:00", "postgresql-k8s-1"], [34, 1895825568, "no recovery target specified", "2024-08-22T00:50:00.100711+00:00", "postgresql-k8s-1"], [35, 1912602784, "no recovery target specified", "2024-08-22T01:08:31.767597+00:00", "postgresql-k8s-2"], [36, 1996488864, "no recovery target specified", "2024-08-22T01:16:46.836412+00:00", "postgresql-k8s-2"], [37, 2046820512, "no recovery target specified", "2024-08-22T01:22:14.300951+00:00", "postgresql-k8s-2"], [38, 2063597728, "no recovery target specified", "2024-08-22T01:37:09.822397+00:00", "postgresql-k8s-2"], [39, 2080374944, "no recovery target specified"], [40, 2097152160, "no recovery target specified"], [41, 2097913152, "no recovery target specified", "2024-08-22T01:39:17.450603+00:00", "postgresql-k8s-2"], [42, 2099390784, "no recovery target specified", "2024-08-22T01:40:38.110730+00:00", "postgresql-k8s-2"], [43, 2103169392, "no recovery target specified", "2024-08-22T01:48:05.111421+00:00", "postgresql-k8s-2"], [44, 2113929376, "no recovery target specified", "2024-08-22T01:49:39.543567+00:00", "postgresql-k8s-2"], [45, 2244206896, "no recovery target specified", "2024-08-22T06:53:08.623227+00:00", "postgresql-k8s-2"], [46, 2264924320, "no recovery target specified", "2024-08-22T07:21:39.970059+00:00", "postgresql-k8s-2"], [47, 2265634360, "no recovery target specified", "2024-08-22T07:22:18.270123+00:00", "postgresql-k8s-2"], [48, 2365587616, "no recovery target specified", "2024-08-22T11:12:21.166659+00:00", "postgresql-k8s-2"], [49, 2449473696, "no recovery target specified", "2024-08-22T13:05:11.466674+00:00", "postgresql-k8s-2"], [50, 2536272576, "no recovery target specified", "2024-08-22T14:49:17.826013+00:00", "postgresql-k8s-1"], [51, 2566914208, "no recovery target specified", "2024-08-22T15:25:50.047575+00:00", "postgresql-k8s-1"], [52, 2667577504, "no recovery target specified", "2024-08-22T18:47:57.516309+00:00", "postgresql-k8s-1"], [53, 2684354720, "no recovery target specified", "2024-08-22T18:48:34.816683+00:00", "postgresql-k8s-1"], [54, 2762614136, "no recovery target specified", "2024-08-22T20:36:30.098472+00:00", "postgresql-k8s-1"], [55, 2885681312, "no recovery target specified", "2024-08-22T23:18:51.124283+00:00", "postgresql-k8s-1"], [56, 3004380064, "no recovery target specified", "2024-08-23T02:00:48.446620+00:00", "postgresql-k8s-1"], [57, 3170893984, "no recovery target specified", "2024-08-23T08:12:11.806884+00:00", "postgresql-k8s-1"], [58, 3221225632, "no recovery target specified", "2024-08-23T08:20:27.883665+00:00", "postgresql-k8s-2"], [59, 3405775008, "no recovery target specified", "2024-08-23T14:03:52.177395+00:00", "postgresql-k8s-1"], [60, 3416972144, "no recovery target specified", "2024-08-23T14:32:54.018751+00:00", "postgresql-k8s-1"], [61, 3489661088, "no recovery target specified", "2024-08-23T16:57:24.741255+00:00", "postgresql-k8s-2"], [62, 3556769952, "no recovery target specified", "2024-08-23T17:57:45.602897+00:00", "postgresql-k8s-2"], [63, 3558569864, "no recovery target specified"], [64, 3574118728, "no recovery target specified", "2024-08-23T18:02:31.729314+00:00", "postgresql-k8s-0"], [65, 3623878816, "no recovery target specified", "2024-08-23T18:05:14.519149+00:00", "postgresql-k8s-0"], [66, 3640656032, "no recovery target specified", "2024-08-23T18:06:06.143730+00:00", "postgresql-k8s-0"], [67, 3657433248, "no recovery target specified", "2024-08-23T18:06:51.312237+00:00", "postgresql-k8s-0"], [68, 3674210464, "no recovery target specified", "2024-08-23T18:09:37.034095+00:00", "postgresql-k8s-0"], [69, 3675247600, "no recovery target specified", "2024-08-23T18:10:27.686663+00:00", "postgresql-k8s-0"], [70, 3676823952, "no recovery target specified"], [71, 3690987680, "no recovery target specified", "2024-08-23T18:11:35.688415+00:00", "postgresql-k8s-1"], [72, 3707764896, "no recovery target specified", "2024-08-23T18:12:43.540377+00:00", "postgresql-k8s-2"], [73, 3707765384, "no recovery target specified", "2024-08-23T18:14:16.000530+00:00", "postgresql-k8s-2"], [74, 3724542112, "no recovery target specified", "2024-08-23T18:15:19.460634+00:00", "postgresql-k8s-2"], [75, 3774873760, "no recovery target specified"], [76, 3808428192, "no recovery target specified", "2024-08-23T18:16:58.280797+00:00", "postgresql-k8s-2"], [77, 3858759840, "no recovery target specified", "2024-08-23T18:18:34.760956+00:00", "postgresql-k8s-2"], [78, 3875537056, "no recovery target specified", "2024-08-23T18:19:50.217081+00:00", "postgresql-k8s-2"], [79, 3892314272, "no recovery target specified", "2024-08-23T18:21:29.141244+00:00", "postgresql-k8s-2"], [80, 3909091488, "no recovery target specified", "2024-08-23T18:23:30.985445+00:00", "postgresql-k8s-2"], [81, 3910140768, "no recovery target specified", "2024-08-23T18:24:15.971929+00:00", "postgresql-k8s-0"], [82, 3910141376, "no recovery target specified", "2024-08-23T18:25:24.100697+00:00", "postgresql-k8s-0"], [83, 3925868704, "no recovery target specified", "2024-08-23T18:28:32.594821+00:00", "postgresql-k8s-0"], [84, 4042524456, "no recovery target specified", "2024-08-23T21:46:42.630083+00:00", "postgresql-k8s-0"], [85, 4143972512, "no recovery target specified"], [86, 4160749728, "no recovery target specified"], [87, 4177526944, "no recovery target specified"], [88, 4194304160, "no recovery target specified"], [89, 4195391232, "no recovery target specified"], [90, 4211081376, "no recovery target specified", "2024-08-24T00:23:03.148739+00:00", "postgresql-k8s-0"], [91, 4227858592, "no recovery target specified"], [92, 4244635808, "no recovery target specified", "2024-08-24T00:26:36.435160+00:00", "postgresql-k8s-0"], [93, 4261413024, "no recovery target specified", "2024-08-24T00:27:41.419898+00:00", "postgresql-k8s-0"], [94, 4580314344, "no recovery target specified", "2024-08-24T13:09:14.935904+00:00", "postgresql-k8s-1"], [95, 4825246152, "no recovery target specified", "2024-08-24T19:35:33.410112+00:00", "postgresql-k8s-1"], [96, 4826007536, "no recovery target specified", "2024-08-24T19:35:50.966294+00:00", "postgresql-k8s-1"], [97, 4949278880, "no recovery target specified", "2024-08-25T00:04:38.864026+00:00", "postgresql-k8s-1"], [98, 4952207120, "no recovery target specified", "2024-08-25T00:06:19.022716+00:00", "postgresql-k8s-0"], [99, 4966056096, "no recovery target specified"], [100, 4966277072, "no recovery target specified", "2024-08-25T00:07:21.725701+00:00", "postgresql-k8s-1"], [101, 5251268768, "no recovery target specified", "2024-08-25T07:46:38.725755+00:00", "postgresql-k8s-2"], [102, 5452595360, "no recovery target specified", "2024-08-25T14:37:56.383376+00:00", "postgresql-k8s-0"], [103, 5620367520, "no recovery target specified", "2024-08-25T18:30:26.815016+00:00", "postgresql-k8s-2"], [104, 5638915488, "no recovery target specified", "2024-08-25T19:22:39.372200+00:00", "postgresql-k8s-2"], [105, 5670884344, "no recovery target specified", "2024-08-25T19:23:33.546097+00:00", "postgresql-k8s-1"], [106, 5706456888, "no recovery target specified", "2024-08-25T20:49:17.647071+00:00", "postgresql-k8s-1"], [107, 5855248544, "no recovery target specified", "2024-08-26T02:10:18.555503+00:00", "postgresql-k8s-2"], [108, 5905580192, "no recovery target specified", "2024-08-26T02:17:25.228485+00:00", "postgresql-k8s-1"], [109, 5922357408, "no recovery target specified", "2024-08-26T02:19:28.717547+00:00", "postgresql-k8s-1"], [110, 5939134624, "no recovery target specified", "2024-08-26T02:29:27.486859+00:00", "postgresql-k8s-1"], [111, 5972689056, "no recovery target specified", "2024-08-26T02:42:15.938163+00:00", "postgresql-k8s-1"], [112, 6023020704, "no recovery target specified", "2024-08-26T02:53:43.801624+00:00", "postgresql-k8s-0"], [113, 6325010592, "no recovery target specified", "2024-08-26T14:07:39.477169+00:00", "postgresql-k8s-1"], [114, 6354223144, "no recovery target specified", "2024-08-26T15:32:59.513967+00:00", "postgresql-k8s-1"], [115, 6578829536, "no recovery target specified", "2024-08-27T00:43:10.203360+00:00", "postgresql-k8s-1"], [116, 6777995424, "no recovery target specified", "2024-08-27T08:07:52.321300+00:00", "postgresql-k8s-2"], [117, 6861881504, "no recovery target specified", "2024-08-27T08:21:58.494618+00:00", "postgresql-k8s-2"], [118, 7079985312, "no recovery target specified", "2024-08-27T16:53:25.903452+00:00", "postgresql-k8s-2"], [119, 7080546232, "no recovery target specified", "2024-08-27T16:53:58.291504+00:00", "postgresql-k8s-2"], [120, 7331643552, "no recovery target specified"], [121, 7348420768, "no recovery target specified", "2024-08-28T01:06:18.202914+00:00", "postgresql-k8s-2"], [122, 7734296736, "no recovery target specified", "2024-08-28T16:05:21.315844+00:00", "postgresql-k8s-2"], [123, 7902068896, "no recovery target specified", "2024-08-28T22:19:46.197479+00:00", "postgresql-k8s-2"], [124, 7913359576, "no recovery target specified", "2024-08-28T22:46:48.677040+00:00", "postgresql-k8s-2"], [125, 8120172704, "no recovery target specified"], [126, 8136949920, "no recovery target specified", "2024-08-29T06:49:26.705207+00:00", "postgresql-k8s-2"], [127, 8271167648, "no recovery target specified", "2024-08-29T12:03:47.894811+00:00", "postgresql-k8s-0"], [128, 8316160584, "no recovery target specified", "2024-08-29T12:33:56.290399+00:00", "postgresql-k8s-0"], [129, 8317565192, "no recovery target specified", "2024-08-29T12:35:38.187504+00:00", "postgresql-k8s-0"], [130, 8397712664, "no recovery target specified", "2024-08-29T13:48:47.623300+00:00", "postgresql-k8s-0"], [131, 8438939808, "no recovery target specified", "2024-08-29T13:53:47.226569+00:00", "postgresql-k8s-0"], [132, 8472494240, "no recovery target specified", "2024-08-29T14:55:09.374852+00:00", "postgresql-k8s-0"], [133, 8489271456, "no recovery target specified", "2024-08-29T14:56:59.536059+00:00", "postgresql-k8s-0"], [134, 8522825888, "no recovery target specified"], [135, 8539603104, "no recovery target specified"], [136, 8556380320, "no recovery target specified", "2024-08-29T15:03:11.053952+00:00", "postgresql-k8s-2"], [137, 8573157536, "no recovery target specified"], [138, 8589934752, "no recovery target specified", "2024-08-29T15:07:16.670443+00:00", "postgresql-k8s-2"], [139, 8606711968, "no recovery target specified"], [140, 8623489184, "no recovery target specified", "2024-08-29T15:09:12.662681+00:00", "postgresql-k8s-2"], [141, 8640266400, "no recovery target specified"], [142, 8657043616, "no recovery target specified"], [143, 8673820832, "no recovery target specified"], [144, 8690598048, "no recovery target specified", "2024-08-29T15:14:27.635523+00:00", "postgresql-k8s-0"], [145, 8692610208, "no recovery target specified", "2024-08-29T15:15:44.792364+00:00", "postgresql-k8s-0"], [146, 8693488168, "no recovery target specified", "2024-08-29T15:17:55.189785+00:00", "postgresql-k8s-0"], [147, 8707375264, "no recovery target specified", "2024-08-29T15:20:22.048054+00:00", "postgresql-k8s-2"], [148, 8724152320, "no recovery target specified", "2024-08-29T15:21:21.728036+00:00", "postgresql-k8s-0"], [149, 8774484128, "no recovery target specified"], [150, 8791261344, "no recovery target specified", "2024-08-29T15:24:09.805868+00:00", "postgresql-k8s-0"], [151, 8791765688, "no recovery target specified", "2024-08-29T15:25:36.534813+00:00", "postgresql-k8s-0"], [152, 8793930720, "no recovery target specified", "2024-08-29T15:27:39.748156+00:00", "postgresql-k8s-0"], [153, 8808038560, "no recovery target specified", "2024-08-29T15:29:39.029456+00:00", "postgresql-k8s-0"], [154, 8858370208, "no recovery target specified", "2024-08-29T15:33:07.519728+00:00", "postgresql-k8s-0"], [155, 8875147424, "no recovery target specified", "2024-08-29T15:34:40.624743+00:00", "postgresql-k8s-0"], [156, 8891924640, "no recovery target specified", "2024-08-29T15:35:54.169544+00:00", "postgresql-k8s-0"], [157, 8908701856, "no recovery target specified"], [158, 8925479072, "no recovery target specified", "2024-08-29T15:37:36.578660+00:00", "postgresql-k8s-0"], [159, 8942256288, "no recovery target specified", "2024-08-29T15:38:58.839557+00:00", "postgresql-k8s-0"], [160, 8942652016, "no recovery target specified"], [161, 8959033504, "no recovery target specified"], [162, 8975810720, "no recovery target specified"], [163, 8992587936, "no recovery target specified", "2024-08-29T15:46:43.588630+00:00", "postgresql-k8s-0"], [164, 9009365152, "no recovery target specified", "2024-08-29T15:47:54.457406+00:00", "postgresql-k8s-0"], [165, 9010215304, "no recovery target specified"], [166, 9026142368, "no recovery target specified"], [167, 9042919584, "no recovery target specified"], [168, 9059696800, "no recovery target specified", "2024-08-29T15:51:58.764079+00:00", "postgresql-k8s-0"], [169, 9076474016, "no recovery target specified", "2024-08-29T15:53:25.853032+00:00", "postgresql-k8s-0"], [170, 9093251232, "no recovery target specified"], [171, 9093458248, "no recovery target specified", "2024-08-29T15:58:07.552114+00:00", "postgresql-k8s-0"], [172, 9110028448, "no recovery target specified", "2024-08-29T15:58:51.464837+00:00", "postgresql-k8s-2"], [173, 9160360096, "no recovery target specified", "2024-08-29T16:02:38.447078+00:00", "postgresql-k8s-0"], [174, 9244246176, "no recovery target specified", "2024-08-29T16:22:30.692116+00:00", "postgresql-k8s-0"], [175, 9261023392, "no recovery target specified", "2024-08-29T16:23:22.120677+00:00", "postgresql-k8s-0"], [176, 9288409336, "no recovery target specified", "2024-08-29T17:32:28.241821+00:00", "postgresql-k8s-1"], [177, 9344909472, "no recovery target specified", "2024-08-29T18:29:28.763162+00:00", "postgresql-k8s-1"], [178, 9663676576, "no recovery target specified", "2024-08-30T04:58:20.083174+00:00", "postgresql-k8s-1"], [179, 9680453792, "no recovery target specified", "2024-08-30T05:12:05.623284+00:00", "postgresql-k8s-1"], [180, 9730785440, "no recovery target specified", "2024-08-30T07:08:15.004925+00:00", "postgresql-k8s-1"], [181, 9810237136, "no recovery target specified", "2024-08-30T10:24:07.234241+00:00", "postgresql-k8s-0"], [182, 9982443680, "no recovery target specified", "2024-08-30T15:10:50.299920+00:00", "postgresql-k8s-0"], [183, 10051283608, "no recovery target specified", "2024-08-30T18:01:18.251849+00:00", "postgresql-k8s-0"], [184, 10133438624, "no recovery target specified", "2024-08-30T21:16:43.488580+00:00", "postgresql-k8s-0"], [185, 10905190560, "no recovery target specified"], [186, 10921967776, "no recovery target specified"], [187, 10938744992, "no recovery target specified", "2024-09-01T00:05:18.863917+00:00", "postgresql-k8s-0"], [188, 10940765624, "no recovery target specified", "2024-09-01T00:08:19.321877+00:00", "postgresql-k8s-0"], [189, 10945085512, "no recovery target specified", "2024-09-01T00:15:10.611045+00:00", "postgresql-k8s-1"], [190, 11173626016, "no recovery target specified", "2024-09-01T08:51:05.830425+00:00", "postgresql-k8s-1"], [191, 11426276984, "no recovery target specified", "2024-09-01T15:38:50.611334+00:00", "postgresql-k8s-1"], [192, 11492393120, "no recovery target specified", "2024-09-01T16:37:05.956891+00:00", "postgresql-k8s-1"], [193, 11595880664, "no recovery target specified", "2024-09-01T20:50:27.115296+00:00", "postgresql-k8s-2"], [194, 11609833632, "no recovery target specified", "2024-09-01T21:00:43.492497+00:00", "postgresql-k8s-2"], [195, 11631935744, "no recovery target specified", "2024-09-01T21:55:23.494862+00:00", "postgresql-k8s-2"], [196, 11663238936, "no recovery target specified", "2024-09-01T23:11:59.513540+00:00", "postgresql-k8s-0"], [197, 11710496928, "no recovery target specified", "2024-09-01T23:53:18.776410+00:00", "postgresql-k8s-0"], [198, 11794383008, "no recovery target specified", "2024-09-02T03:13:46.615274+00:00", "postgresql-k8s-0"], [199, 12733907104, "no recovery target specified", "2024-09-03T10:49:54.016630+00:00", "postgresql-k8s-0"], [200, 12750684320, "no recovery target specified", "2024-09-03T11:07:28.008951+00:00", "postgresql-k8s-0"], [201, 12752886712, "no recovery target specified", "2024-09-03T11:12:43.150985+00:00", "postgresql-k8s-1"], [202, 19042931640, "no recovery target specified", "2024-09-03T14:52:37.649814+00:00", "postgresql-k8s-0"], [203, 19058917536, "no recovery target specified", "2024-09-03T14:59:22.994050+00:00", "postgresql-k8s-0"], [204, 22414412504, "no recovery target specified", "2024-09-03T16:44:15.523685+00:00", "postgresql-k8s-0"], [205, 34191966368, "no recovery target specified", "2024-09-03T22:23:48.272506+00:00", "postgresql-k8s-0"], [206, 49845108896, "no recovery target specified", "2024-09-04T05:41:55.227049+00:00", "postgresql-k8s-0"], [207, 67478041992, "no recovery target specified", "2024-09-04T14:01:36.483424+00:00", "postgresql-k8s-0"], [208, 85816067000, "no recovery target specified"], [209, 85816499352, "no recovery target specified", "2024-09-04T22:47:02.128342+00:00", "postgresql-k8s-0"], [210, 85849360888, "no recovery target specified", "2024-09-04T22:47:57.168929+00:00", "postgresql-k8s-0"], [211, 85899619016, "no recovery target specified", "2024-09-04T22:49:58.218220+00:00", "postgresql-k8s-0"], [212, 85967585104, "no recovery target specified", "2024-09-04T22:52:49.972051+00:00", "postgresql-k8s-0"]]

juju ssh --container postgresql postgresql-k8s/1 "find /var/log/postgresql/ -name postgresql*.log -not -empty -exec ls {} \; -exec cat {} \;":

2024-09-05 01:39:42 UTC [94097]: user=operator,db=postgres,app=[unknown],client=127.0.0.1,line=1 FATAL:  the database system is starting up
2024-09-05 01:39:49 UTC [94099]: user=operator,db=postgres,app=[unknown],client=127.0.0.1,line=1 FATAL:  the database system is starting up
2024-09-05 01:39:51 UTC [94101]: user=operator,db=postgres,app=[unknown],client=127.0.0.1,line=1 FATAL:  the database system is starting up
2024-09-05 01:39:51 UTC [94102]: user=operator,db=postgres,app=[unknown],client=127.0.0.1,line=1 FATAL:  the database system is starting up
2024-09-05 01:39:52 UTC [94103]: user=operator,db=postgres,app=[unknown],client=127.0.0.1,line=1 FATAL:  the database system is starting up
2024-09-05 01:39:59 UTC [94105]: user=operator,db=postgres,app=[unknown],client=127.0.0.1,line=1 FATAL:  the database system is starting up
marceloneppel commented 1 month ago

Thanks for the details, @kelkawi-a!

Do you still have the Juju debug logs that show something (like the stack trace) from the errors shown in the Unit 1 status log? I mean, the errors in the start and update-status hooks. Those will be useful to understand what happened before the unit reached its current state.

Do you know if there are a lot of clients connecting to the database, especially through the read-only endpoints (replicas)?

If so, we can try to stop the PostgreSQL service in the replica by issuing the following command.

juju ssh --container postgresql postgresql-k8s/1 pebble stop postgresql

Then, after some seconds, we can start it again to see if it starts correctly.

juju ssh --container postgresql postgresql-k8s/1 pebble start postgresql

Also, did the chown command fix Unit 2?

kelkawi-a commented 1 month ago

Unfortunately I don't have visibility on the logs that far back. Since this issue came up. Since reporting this initial bug, the units have re-configured themselves as follows:

postgresql-k8s/0                     maintenance  idle             reinitialising replica
postgresql-k8s/1                     waiting      idle             awaiting for member to start
postgresql-k8s/2*                    active       idle             Primary

Note: the Primary unit intermittently goes into a maintenance status with the message reconfiguring cluster.

I can confirm that there is 5 applications (3 units each) connecting to the postgresql-k8s application, each of them occupying a number of connection slots.

I've sent you an invite to try and debug this live on the environment if possible.