Open jeffreychang911 opened 3 months ago
From another testrun that hit the same error on the unit that fails to connect we have this log message where the connection is getting reset by peer while trying to connect.
This reset connection happens on the other units once while they are coming up, but it is repeated ad nauseam in the failed unit where it never connects.
I'm not seeing anything else in the logs that signals to a service not starting or otherwise, but it seems like postgres doesnt start locally, so therefore it cant get health?
2024-07-18T08:43:30.530867708Z stdout F 2024-07-18T08:43:30.530Z [postgresql] Exception in thread Thread-1802 (process_request_thread):
2024-07-18T08:43:30.530886993Z stdout F 2024-07-18T08:43:30.530Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:30.5308897Z stdout F 2024-07-18T08:43:30.530Z [postgresql] File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:30.531229302Z stdout F 2024-07-18T08:43:30.531Z [postgresql] self.run()
2024-07-18T08:43:30.531240374Z stdout F 2024-07-18T08:43:30.531Z [postgresql] File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:30.53139845Z stdout F 2024-07-18T08:43:30.531Z [postgresql] self._target(*self._args, **self._kwargs)
2024-07-18T08:43:30.531405551Z stdout F 2024-07-18T08:43:30.531Z [postgresql] File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:30.531709129Z stdout F 2024-07-18T08:43:30.531Z [postgresql] request.do_handshake()
2024-07-18T08:43:30.53171517Z stdout F 2024-07-18T08:43:30.531Z [postgresql] File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:30.531993036Z stdout F 2024-07-18T08:43:30.531Z [postgresql] self._sslobj.do_handshake()
2024-07-18T08:43:30.531999241Z stdout F 2024-07-18T08:43:30.531Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:33.54831443Z stdout F 2024-07-18T08:43:33.548Z [postgresql] Exception in thread Thread-1803 (process_request_thread):
2024-07-18T08:43:33.548336755Z stdout F 2024-07-18T08:43:33.548Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:33.54833941Z stdout F 2024-07-18T08:43:33.548Z [postgresql] File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:33.548341374Z stdout F 2024-07-18T08:43:33.548Z [postgresql] self.run()
2024-07-18T08:43:33.548343111Z stdout F 2024-07-18T08:43:33.548Z [postgresql] File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:33.54834474Z stdout F 2024-07-18T08:43:33.548Z [postgresql] self._target(*self._args, **self._kwargs)
2024-07-18T08:43:33.548346971Z stdout F 2024-07-18T08:43:33.548Z [postgresql] File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:33.548348695Z stdout F 2024-07-18T08:43:33.548Z [postgresql] request.do_handshake()
2024-07-18T08:43:33.548350325Z stdout F 2024-07-18T08:43:33.548Z [postgresql] File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:33.548351945Z stdout F 2024-07-18T08:43:33.548Z [postgresql] self._sslobj.do_handshake()
2024-07-18T08:43:33.54835357Z stdout F 2024-07-18T08:43:33.548Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:36.555747743Z stdout F 2024-07-18T08:43:36.555Z [postgresql] Exception in thread Thread-1804 (process_request_thread):
2024-07-18T08:43:36.555893062Z stdout F 2024-07-18T08:43:36.555Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:36.555900221Z stdout F 2024-07-18T08:43:36.555Z [postgresql] File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:36.556189845Z stdout F 2024-07-18T08:43:36.556Z [postgresql] self.run()
2024-07-18T08:43:36.556289238Z stdout F 2024-07-18T08:43:36.556Z [postgresql] File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:36.556511996Z stdout F 2024-07-18T08:43:36.556Z [postgresql] self._target(*self._args, **self._kwargs)
2024-07-18T08:43:36.556518628Z stdout F 2024-07-18T08:43:36.556Z [postgresql] File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:36.556826132Z stdout F 2024-07-18T08:43:36.556Z [postgresql] request.do_handshake()
2024-07-18T08:43:36.556887803Z stdout F 2024-07-18T08:43:36.556Z [postgresql] File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:36.557129105Z stdout F 2024-07-18T08:43:36.557Z [postgresql] self._sslobj.do_handshake()
2024-07-18T08:43:36.557194273Z stdout F 2024-07-18T08:43:36.557Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:38.899227117Z stderr F 2024-07-18T08:43:38.899Z [pebble] GET /v1/notices?after=2024-07-18T08%3A32%3A08.874364511Z&timeout=30s 30.000151934s 200
2024-07-18T08:43:39.571698564Z stdout F 2024-07-18T08:43:39.571Z [postgresql] Exception in thread Thread-1805 (process_request_thread):
2024-07-18T08:43:39.571720471Z stdout F 2024-07-18T08:43:39.571Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:39.571752036Z stdout F 2024-07-18T08:43:39.571Z [postgresql] File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:39.572014237Z stdout F 2024-07-18T08:43:39.571Z [postgresql] self.run()
2024-07-18T08:43:39.572085537Z stdout F 2024-07-18T08:43:39.572Z [postgresql] File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:39.572296492Z stdout F 2024-07-18T08:43:39.572Z [postgresql] self._target(*self._args, **self._kwargs)
2024-07-18T08:43:39.572327798Z stdout F 2024-07-18T08:43:39.572Z [postgresql] File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:39.572598412Z stdout F 2024-07-18T08:43:39.572Z [postgresql] request.do_handshake()
2024-07-18T08:43:39.572644153Z stdout F 2024-07-18T08:43:39.572Z [postgresql] File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:39.572947519Z stdout F 2024-07-18T08:43:39.572Z [postgresql] self._sslobj.do_handshake()
2024-07-18T08:43:39.572955647Z stdout F 2024-07-18T08:43:39.572Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:40.179252032Z stdout F 2024-07-18T08:43:40.179Z [postgresql] Exception in thread Thread-1806 (process_request_thread):
2024-07-18T08:43:40.17927862Z stdout F 2024-07-18T08:43:40.179Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:40.179310988Z stdout F 2024-07-18T08:43:40.179Z [postgresql] File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:40.179593344Z stdout F 2024-07-18T08:43:40.179Z [postgresql] self.run()
2024-07-18T08:43:40.179601942Z stdout F 2024-07-18T08:43:40.179Z [postgresql] File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:40.179805582Z stdout F 2024-07-18T08:43:40.179Z [postgresql] self._target(*self._args, **self._kwargs)
2024-07-18T08:43:40.179841797Z stdout F 2024-07-18T08:43:40.179Z [postgresql] File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:40.18015769Z stdout F 2024-07-18T08:43:40.180Z [postgresql] request.do_handshake()
2024-07-18T08:43:40.180202199Z stdout F 2024-07-18T08:43:40.180Z [postgresql] File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:40.180470562Z stdout F 2024-07-18T08:43:40.180Z [postgresql] self._sslobj.do_handshake()
2024-07-18T08:43:40.180539751Z stdout F 2024-07-18T08:43:40.180Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:40.180758071Z stderr F 2024-07-18T08:43:40.180Z [pebble] Check "postgresql" failure 403 (threshold 3): Get "http://postgresql-k8s-1.postgresql-k8s-endpoints:8008/health": read tcp 192.168.252.69:35004->192.168.252.69:8008: read: connection reset by peer
Hi, @jeffreychang911 and @asbalderson! Thanks for the report. This issue was scheduled for the next pulse.
Hi, @jeffreychang911 and @asbalderson! Do you have any environment we could access to reproduce this issue? I tried both on a VM and a PS6 model but couldn't reproduce it.
I checked our test log, this issue only happened twice in July with rev 281. We didn't see that in last 90+ runs since.
We have another run with similar symptom, non-leader node stuck in "awaiting for member to start" crashdump here - https://oil-jenkins.canonical.com/artifacts/5630363a-a6f5-491e-90b2-d4377a5ffb32/generated/generated/postgresql-k8s/crashdump-2024-09-06-01.25.24.tar.gz but debug log is bit different
unit-postgresql-k8s-1: 2024-09-06 00:29:09 ERROR unit.postgresql-k8s/1.juju-log Uncaught exception while in charm code: Traceback (most recent call last): File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 467, in _make_request self._validate_conn(conn) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 1099, in _validate_conn conn.connect() File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connection.py", line 653, in connect sock_and_verified = _ssl_wrap_socket_and_match_hostname( File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connection.py", line 806, in _ssl_wrap_socket_and_match_hostname ssl_sock = ssl_wrapsocket( File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/util/ssl.py", line 465, in ssl_wrap_socket ssl_sock = _ssl_wrap_socket_impl(sock, context, tls_in_tls, serverhostname) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/util/ssl.py", line 509, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/usr/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/usr/lib/python3.10/ssl.py", line 1100, in _create self.do_handshake() File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake self._sslobj.do_handshake() ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1007)
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 793, in urlopen response = self._make_request( File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 491, in _make_request raise new_e urllib3.exceptions.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1007)
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/adapters.py", line 667, in send resp = conn.urlopen( File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 847, in urlopen retries = retries.increment( File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/util/retry.py", line 515, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='postgresql-k8s-1.postgresql-k8s-endpoints', port=8008): Max retries exceeded with url: /cluster (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1007)')))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/src/patroni.py", line 136, in get_primary r = requests.get(f"{url}/cluster", verify=self._verify, timeout=5) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/api.py", line 73, in get return request("get", url, params=params, kwargs) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/sessions.py", line 589, in request resp = self.send(prep, send_kwargs) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/sessions.py", line 703, in send r = adapter.send(request, kwargs) File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/adapters.py", line 698, in send raise SSLError(e, request=request) requests.exceptions.SSLError: HTTPSConnectionPool(host='postgresql-k8s-1.postgresql-k8s-endpoints', port=8008): Max retries exceeded with url: /cluster (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1007)')))
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/./src/charm.py", line 1806, in
Just for the history, the same issue reported in https://warthogs.atlassian.net/browse/DPE-4589 It looks like it is no longer reproducible with the latest 14/stable revison 382. @jeffreychang911 can you confirm the same from your side? Tnx!
We are not seeing SSLErrors recently. However, "awaiting for member to start" still pops up sometimes. Here's a most recent occurrence with rev 381, https://solutions.qa.canonical.com/testruns/191337f5-d2a6-4cf5-9a9b-f9e6f654f540 pod logs available from https://oil-jenkins.canonical.com/artifacts/191337f5-d2a6-4cf5-9a9b-f9e6f654f540/generated/generated/kubernetes-openstack/juju-crashdump-kubernetes-openstack-2024-11-06-18.06.52.tar.gz
Steps to reproduce
Expected behavior
Postgresql-k8s charm should settle shortly after deploy.
Actual behavior
Juju status found one unit stuck in waiting until timeout in 1 hr. Unit Workload Agent Address Ports Message data-integrator/0 active idle 192.168.254.204 postgresql-k8s/0 waiting executing 192.168.252.201 awaiting for member to start postgresql-k8s/1 active idle 192.168.253.201 Primary postgresql-k8s/2 active idle 192.168.254.203 self-signed-certificates/0* active idle 192.168.252.200
Only found one ERROR from juju debug-log unit-postgresql-k8s-0: 2024-07-11 07:54:44 ERROR unit.postgresql-k8s/0.juju-log certificates:3: Cannot push TLS certificates: RetryError(<Future at 0x7f1ee338f0a0 state=finished raised ConnectionError>)
Versions
Operating system: Jammy
Juju CLI: 3.5.2
Juju agent: 3.5.2
Charm revision: postgresql-k8s charm rev 281
Charmed Kubernetes 1.30/beta, and would be 1.30/stable soon without change.
Log output
Juju debug log: unit-postgresql-k8s-0: 2024-07-11 07:54:44 ERROR unit.postgresql-k8s/0.juju-log certificates:3: Cannot push TLS certificates: RetryError(<Future at 0x7f1ee338f0a0 state=finished raised ConnectionError>)
Additional context
This is found in a SolQA run, https://solutions.qa.canonical.com/testruns/5dc43cf9-2211-4b4c-9a69-a39d4d61176e Crashdump - https://oil-jenkins.canonical.com/artifacts/5dc43cf9-2211-4b4c-9a69-a39d4d61176e/generated/generated/postgresql-k8s/crashdump-2024-07-11-08.49.08.tar.gz