canonical / postgresql-k8s-operator

A Charmed Operator for running PostgreSQL on Kubernetes
https://charmhub.io/postgresql-k8s
Apache License 2.0
10 stars 20 forks source link

A non-leader unit stuck in "awaiting for member to start" #560

Open jeffreychang911 opened 3 months ago

jeffreychang911 commented 3 months ago

Steps to reproduce

  1. SolQA deployed Charmed Kubernetes 1.30/beta on top of Charmed Openstack Yoga, and then deploy mysql-k8s and postgresql-k8s in different model on top of Charmed K8s.
  2. self-signed-certificates and data-integrator are both deployed and related.

Expected behavior

Postgresql-k8s charm should settle shortly after deploy.

Actual behavior

Juju status found one unit stuck in waiting until timeout in 1 hr. Unit Workload Agent Address Ports Message data-integrator/0 active idle 192.168.254.204 postgresql-k8s/0 waiting executing 192.168.252.201 awaiting for member to start postgresql-k8s/1 active idle 192.168.253.201 Primary postgresql-k8s/2 active idle 192.168.254.203 self-signed-certificates/0* active idle 192.168.252.200

Only found one ERROR from juju debug-log unit-postgresql-k8s-0: 2024-07-11 07:54:44 ERROR unit.postgresql-k8s/0.juju-log certificates:3: Cannot push TLS certificates: RetryError(<Future at 0x7f1ee338f0a0 state=finished raised ConnectionError>)

Versions

Operating system: Jammy

Juju CLI: 3.5.2

Juju agent: 3.5.2

Charm revision: postgresql-k8s charm rev 281

Charmed Kubernetes 1.30/beta, and would be 1.30/stable soon without change.

Log output

Juju debug log: unit-postgresql-k8s-0: 2024-07-11 07:54:44 ERROR unit.postgresql-k8s/0.juju-log certificates:3: Cannot push TLS certificates: RetryError(<Future at 0x7f1ee338f0a0 state=finished raised ConnectionError>)

Additional context

This is found in a SolQA run, https://solutions.qa.canonical.com/testruns/5dc43cf9-2211-4b4c-9a69-a39d4d61176e Crashdump - https://oil-jenkins.canonical.com/artifacts/5dc43cf9-2211-4b4c-9a69-a39d4d61176e/generated/generated/postgresql-k8s/crashdump-2024-07-11-08.49.08.tar.gz

github-actions[bot] commented 3 months ago

https://warthogs.atlassian.net/browse/DPE-4882

asbalderson commented 3 months ago

From another testrun that hit the same error on the unit that fails to connect we have this log message where the connection is getting reset by peer while trying to connect.

This reset connection happens on the other units once while they are coming up, but it is repeated ad nauseam in the failed unit where it never connects.

I'm not seeing anything else in the logs that signals to a service not starting or otherwise, but it seems like postgres doesnt start locally, so therefore it cant get health?

2024-07-18T08:43:30.530867708Z stdout F 2024-07-18T08:43:30.530Z [postgresql] Exception in thread Thread-1802 (process_request_thread):
2024-07-18T08:43:30.530886993Z stdout F 2024-07-18T08:43:30.530Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:30.5308897Z stdout F 2024-07-18T08:43:30.530Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:30.531229302Z stdout F 2024-07-18T08:43:30.531Z [postgresql]     self.run()
2024-07-18T08:43:30.531240374Z stdout F 2024-07-18T08:43:30.531Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:30.53139845Z stdout F 2024-07-18T08:43:30.531Z [postgresql]     self._target(*self._args, **self._kwargs)
2024-07-18T08:43:30.531405551Z stdout F 2024-07-18T08:43:30.531Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:30.531709129Z stdout F 2024-07-18T08:43:30.531Z [postgresql]     request.do_handshake()
2024-07-18T08:43:30.53171517Z stdout F 2024-07-18T08:43:30.531Z [postgresql]   File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:30.531993036Z stdout F 2024-07-18T08:43:30.531Z [postgresql]     self._sslobj.do_handshake()
2024-07-18T08:43:30.531999241Z stdout F 2024-07-18T08:43:30.531Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:33.54831443Z stdout F 2024-07-18T08:43:33.548Z [postgresql] Exception in thread Thread-1803 (process_request_thread):
2024-07-18T08:43:33.548336755Z stdout F 2024-07-18T08:43:33.548Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:33.54833941Z stdout F 2024-07-18T08:43:33.548Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:33.548341374Z stdout F 2024-07-18T08:43:33.548Z [postgresql]     self.run()
2024-07-18T08:43:33.548343111Z stdout F 2024-07-18T08:43:33.548Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:33.54834474Z stdout F 2024-07-18T08:43:33.548Z [postgresql]     self._target(*self._args, **self._kwargs)
2024-07-18T08:43:33.548346971Z stdout F 2024-07-18T08:43:33.548Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:33.548348695Z stdout F 2024-07-18T08:43:33.548Z [postgresql]     request.do_handshake()
2024-07-18T08:43:33.548350325Z stdout F 2024-07-18T08:43:33.548Z [postgresql]   File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:33.548351945Z stdout F 2024-07-18T08:43:33.548Z [postgresql]     self._sslobj.do_handshake()
2024-07-18T08:43:33.54835357Z stdout F 2024-07-18T08:43:33.548Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:36.555747743Z stdout F 2024-07-18T08:43:36.555Z [postgresql] Exception in thread Thread-1804 (process_request_thread):
2024-07-18T08:43:36.555893062Z stdout F 2024-07-18T08:43:36.555Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:36.555900221Z stdout F 2024-07-18T08:43:36.555Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:36.556189845Z stdout F 2024-07-18T08:43:36.556Z [postgresql]     self.run()
2024-07-18T08:43:36.556289238Z stdout F 2024-07-18T08:43:36.556Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:36.556511996Z stdout F 2024-07-18T08:43:36.556Z [postgresql]     self._target(*self._args, **self._kwargs)
2024-07-18T08:43:36.556518628Z stdout F 2024-07-18T08:43:36.556Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:36.556826132Z stdout F 2024-07-18T08:43:36.556Z [postgresql]     request.do_handshake()
2024-07-18T08:43:36.556887803Z stdout F 2024-07-18T08:43:36.556Z [postgresql]   File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:36.557129105Z stdout F 2024-07-18T08:43:36.557Z [postgresql]     self._sslobj.do_handshake()
2024-07-18T08:43:36.557194273Z stdout F 2024-07-18T08:43:36.557Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:38.899227117Z stderr F 2024-07-18T08:43:38.899Z [pebble] GET /v1/notices?after=2024-07-18T08%3A32%3A08.874364511Z&timeout=30s 30.000151934s 200
2024-07-18T08:43:39.571698564Z stdout F 2024-07-18T08:43:39.571Z [postgresql] Exception in thread Thread-1805 (process_request_thread):
2024-07-18T08:43:39.571720471Z stdout F 2024-07-18T08:43:39.571Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:39.571752036Z stdout F 2024-07-18T08:43:39.571Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:39.572014237Z stdout F 2024-07-18T08:43:39.571Z [postgresql]     self.run()
2024-07-18T08:43:39.572085537Z stdout F 2024-07-18T08:43:39.572Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:39.572296492Z stdout F 2024-07-18T08:43:39.572Z [postgresql]     self._target(*self._args, **self._kwargs)
2024-07-18T08:43:39.572327798Z stdout F 2024-07-18T08:43:39.572Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:39.572598412Z stdout F 2024-07-18T08:43:39.572Z [postgresql]     request.do_handshake()
2024-07-18T08:43:39.572644153Z stdout F 2024-07-18T08:43:39.572Z [postgresql]   File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:39.572947519Z stdout F 2024-07-18T08:43:39.572Z [postgresql]     self._sslobj.do_handshake()
2024-07-18T08:43:39.572955647Z stdout F 2024-07-18T08:43:39.572Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:40.179252032Z stdout F 2024-07-18T08:43:40.179Z [postgresql] Exception in thread Thread-1806 (process_request_thread):
2024-07-18T08:43:40.17927862Z stdout F 2024-07-18T08:43:40.179Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:40.179310988Z stdout F 2024-07-18T08:43:40.179Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:40.179593344Z stdout F 2024-07-18T08:43:40.179Z [postgresql]     self.run()
2024-07-18T08:43:40.179601942Z stdout F 2024-07-18T08:43:40.179Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:40.179805582Z stdout F 2024-07-18T08:43:40.179Z [postgresql]     self._target(*self._args, **self._kwargs)
2024-07-18T08:43:40.179841797Z stdout F 2024-07-18T08:43:40.179Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:40.18015769Z stdout F 2024-07-18T08:43:40.180Z [postgresql]     request.do_handshake()
2024-07-18T08:43:40.180202199Z stdout F 2024-07-18T08:43:40.180Z [postgresql]   File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:40.180470562Z stdout F 2024-07-18T08:43:40.180Z [postgresql]     self._sslobj.do_handshake()
2024-07-18T08:43:40.180539751Z stdout F 2024-07-18T08:43:40.180Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:40.180758071Z stderr F 2024-07-18T08:43:40.180Z [pebble] Check "postgresql" failure 403 (threshold 3): Get "http://postgresql-k8s-1.postgresql-k8s-endpoints:8008/health": read tcp 192.168.252.69:35004->192.168.252.69:8008: read: connection reset by peer
marceloneppel commented 3 months ago

Hi, @jeffreychang911 and @asbalderson! Thanks for the report. This issue was scheduled for the next pulse.

marceloneppel commented 2 months ago

Hi, @jeffreychang911 and @asbalderson! Do you have any environment we could access to reproduce this issue? I tried both on a VM and a PS6 model but couldn't reproduce it.

jeffreychang911 commented 2 months ago

I checked our test log, this issue only happened twice in July with rev 281. We didn't see that in last 90+ runs since.

jeffreychang911 commented 1 month ago

We have another run with similar symptom, non-leader node stuck in "awaiting for member to start" crashdump here - https://oil-jenkins.canonical.com/artifacts/5630363a-a6f5-491e-90b2-d4377a5ffb32/generated/generated/postgresql-k8s/crashdump-2024-09-06-01.25.24.tar.gz but debug log is bit different

unit-postgresql-k8s-1: 2024-09-06 00:29:09 ERROR unit.postgresql-k8s/1.juju-log Uncaught exception while in charm code: Traceback (most recent call last): File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 467, in _make_request self._validate_conn(conn) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 1099, in _validate_conn conn.connect() File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connection.py", line 653, in connect sock_and_verified = _ssl_wrap_socket_and_match_hostname( File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connection.py", line 806, in _ssl_wrap_socket_and_match_hostname ssl_sock = ssl_wrapsocket( File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/util/ssl.py", line 465, in ssl_wrap_socket ssl_sock = _ssl_wrap_socket_impl(sock, context, tls_in_tls, serverhostname) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/util/ssl.py", line 509, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/usr/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/usr/lib/python3.10/ssl.py", line 1100, in _create self.do_handshake() File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake self._sslobj.do_handshake() ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1007)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 793, in urlopen response = self._make_request( File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 491, in _make_request raise new_e urllib3.exceptions.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1007)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/adapters.py", line 667, in send resp = conn.urlopen( File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 847, in urlopen retries = retries.increment( File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/util/retry.py", line 515, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='postgresql-k8s-1.postgresql-k8s-endpoints', port=8008): Max retries exceeded with url: /cluster (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1007)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/src/patroni.py", line 136, in get_primary r = requests.get(f"{url}/cluster", verify=self._verify, timeout=5) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/api.py", line 73, in get return request("get", url, params=params, kwargs) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/sessions.py", line 589, in request resp = self.send(prep, send_kwargs) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/sessions.py", line 703, in send r = adapter.send(request, kwargs) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/adapters.py", line 698, in send raise SSLError(e, request=request) requests.exceptions.SSLError: HTTPSConnectionPool(host='postgresql-k8s-1.postgresql-k8s-endpoints', port=8008): Max retries exceeded with url: /cluster (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1007)')))

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/./src/charm.py", line 1806, in main(PostgresqlOperatorCharm, use_juju_for_storage=True) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/ops/main.py", line 544, in main manager.run() File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/ops/main.py", line 520, in run self._emit() File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/ops/main.py", line 506, in _emit self.framework.reemit() File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/ops/framework.py", line 861, in reemit self._reemit() File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/ops/framework.py", line 941, in _reemit custom_handler(event) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/./src/charm.py", line 838, in _on_postgresql_pebble_ready self.enable_disable_extensions() File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/./src/charm.py", line 581, in enable_disable_extensions if self._patroni.get_primary() is None: File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/src/patroni.py", line 133, in get_primary for attempt in Retrying(stop=stop_after_attempt(len(self._endpoints) + 1)): File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/tenacity/init.py", line 435, in iter do = self.iter(retry_state=retry_state) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/tenacity/init.py", line 368, in iter result = action(retry_state) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/tenacity/init.py", line 411, in exc_check raise retry_exc from fut.exception() tenacity.RetryError: RetryError[<Future at 0x7fdd57f1d120 state=finished raised SSLError>]

taurus-forever commented 1 month ago

Just for the history, the same issue reported in https://warthogs.atlassian.net/browse/DPE-4589 It looks like it is no longer reproducible with the latest 14/stable revison 382. @jeffreychang911 can you confirm the same from your side? Tnx!

jeffreychang911 commented 22 hours ago

We are not seeing SSLErrors recently. However, "awaiting for member to start" still pops up sometimes. Here's a most recent occurrence with rev 381, https://solutions.qa.canonical.com/testruns/191337f5-d2a6-4cf5-9a9b-f9e6f654f540 pod logs available from https://oil-jenkins.canonical.com/artifacts/191337f5-d2a6-4cf5-9a9b-f9e6f654f540/generated/generated/kubernetes-openstack/juju-crashdump-kubernetes-openstack-2024-11-06-18.06.52.tar.gz