canonical / postgresql-k8s-operator

A Charmed Operator for running PostgreSQL on Kubernetes
https://charmhub.io/postgresql-k8s
Apache License 2.0
9 stars 18 forks source link

postgresql-k8s-endpoints service lost #392

Open AmberCharitos opened 5 months ago

AmberCharitos commented 5 months ago

Steps to reproduce

  1. juju deploy postgresql-k8s --trust
    juju scale-application postgresql-k8s 3
    kubectl delete svc postgresql-k8s-endpoints

Expected behavior

The service is recreated and the units are active

Actual behavior

We see the following error in postgresql container:

psql: error: connection to server at "<ip>", port 5432 failed: Connection refused
    Is the server running on that host and accepting TCP/IP connections?

The units are not able to reach an active status and remain in waiting.

Versions

Operating system: 22.04

Juju CLI: 2.9.46-ubuntu-amd64

Juju agent: 3.1.6

Charm revision: 14/edge 198

Log output

unit-postgresql-k8s-0: 23:11:48 INFO juju.worker.uniter awaiting error resolution for "update-status" hook
unit-postgresql-k8s-1: 23:11:48 DEBUG unit.postgresql-k8s/1.juju-log Starting new HTTP connection (1): postgresql-k8s-1.postgresql-k8s-endpoints:8008
unit-postgresql-k8s-1: 23:11:50 DEBUG unit.postgresql-k8s/1.juju-log Starting new HTTP connection (1): postgresql-k8s-1.postgresql-k8s-endpoints:8008
unit-postgresql-k8s-1: 23:11:50 ERROR unit.postgresql-k8s/1.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 791, in urlopen
    response = self._make_request(
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 497, in _make_request
    conn.request(
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connection.py", line 395, in request
    self.endheaders()
  File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output
    self.send(msg)
  File "/usr/lib/python3.10/http/client.py", line 976, in send
    self.connect()
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connection.py", line 243, in connect
    self.sock = self._new_conn()
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connection.py", line 210, in _new_conn
    raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPConnection object at 0x7f971738e2c0>: Failed to resolve 'postgresql-k8s-1.postgresql-k8s-endpoints' ([Errno -2] Name or service not known)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/connectionpool.py", line 845, in urlopen
    retries = retries.increment(
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='postgresql-k8s-1.postgresql-k8s-endpoints', port=8008): Max retries exceeded with url: /cluster (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f971738e2c0>: Failed to resolve 'postgresql-k8s-1.postgresql-k8s-endpoints' ([Errno -2] Name or service not known)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/src/patroni.py", line 148, in cluster_members
    r = requests.get(f"{self._patroni_url}/cluster", verify=self._verify)
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='postgresql-k8s-1.postgresql-k8s-endpoints', port=8008): Max retries exceeded with url: /cluster (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f971738e2c0>: Failed to resolve 'postgresql-k8s-1.postgresql-k8s-endpoints' ([Errno -2] Name or service not known)"))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/./src/charm.py", line 1574, in <module>
    main(PostgresqlOperatorCharm, use_juju_for_storage=True)
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/ops/main.py", line 434, in main
    framework.reemit()
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/ops/framework.py", line 863, in reemit
    self._reemit()
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/ops/framework.py", line 942, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/./src/charm.py", line 370, in _on_peer_relation_changed
    self._add_members(event)
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/./src/charm.py", line 530, in _add_members
    if self._patroni.cluster_members == self._hosts:
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
  File "/var/lib/juju/agents/unit-postgresql-k8s-1/charm/venv/tenacity/__init__.py", line 326, in iter
    raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x7f971738edd0 state=finished raised ConnectionError>]
unit-postgresql-k8s-1: 23:11:50 ERROR juju.worker.uniter.operation hook "update-status" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

Matrix conversation

github-actions[bot] commented 5 months ago

https://warthogs.atlassian.net/browse/DPE-3565

taurus-forever commented 4 months ago

I have converted this from bug to enhancement. At the moment Juju is responsible for K8s resources and charm is not re-creating them after the bootstrap. I see the valid scenarios when K8s resources (services in this case) can be lost and automated recovery could help, but it should be properly planned and implemented. No quick fix expected here.