canonical / vault-k8s-operator

Vault secure, store and tightly control access to tokens, passwords, certificates, encryption keys for protecting secrets and other sensitive data.
https://charmhub.io/vault-k8s
Apache License 2.0
11 stars 6 forks source link

tls-certificates-access-relation-changed hook fails #303

Closed gruyaume closed 6 months ago

gruyaume commented 6 months ago

Describe the bug

An error during the tls-certificates-access-relation-changed causes the charm to go to an error state. This error is sometimes observed during the integration tests with the test named test_given_vault_deployed_when_tls_access_relation_created_then_existing_certificate_replaced.

To Reproduce

  1. Run integration tests multiple times

Expected behavior

No error

Logs

Juju status

INFO     pytest_operator.plugin:plugin.py:834 Model status:

Model                  Controller                Cloud/Region        Version  SLA          Timestamp
test-integration-yb3i  github-pr-f7751-microk8s  microk8s/localhost  3.4.2    unsupported  00:29:38Z

App                        Version                Status       Scale  Charm                      Channel        Rev  Address         Exposed  Message
loki-k8s                   2.9.4                  waiting        0/1  loki-k8s                   latest/stable  124  10.152.183.242  no       installing agent
minio                      res:oci-image@1755999  active           1  minio                      latest/stable  277  10.152.183.188  no       
prometheus-k8s             2.49.1                 waiting        1/0  prometheus-k8s             latest/stable  171  10.[152](https://github.com/canonical/vault-k8s-operator/actions/runs/8669360690/job/23775968288?pr=302#step:5:153).183.80   no       installing agent
s3-integrator                                     maintenance      1  s3-integrator              latest/stable   13  10.152.183.224  no       stopping charm software
self-signed-certificates                          unknown          0  self-signed-certificates   latest/stable   72  10.152.183.60   no       
tls-certificates-requirer                         unknown          0  tls-certificates-requirer  latest/stable   59  10.152.183.42   no       
vault-k8s                                         waiting          3  vault-k8s                                   0  10.152.183.134  no       installing agent
vault-kv-requirer                                 unknown          0  vault-kv-requirer                           0  10.152.183.26   no       

Unit               Workload     Agent      Address       Ports          Message
loki-k8s/0*        unknown      lost       10.1.236.105                 agent lost, see 'juju show-status-log loki-k8s/0'
minio/0*           active       idle       10.1.236.107  9000-9001/TCP  
prometheus-k8s/0*  waiting      executing  10.1.236.108                 (config-changed) Waiting for resource limit patch to apply
s3-integrator/0*   maintenance  executing  10.1.236.101                 (stop) stopping charm software
vault-k8s/0*       blocked      idle       10.1.236.80                  Please unseal Vault
vault-k8s/1        blocked      idle       10.1.236.89                  Please unseal Vault
vault-k8s/2        error        idle       10.1.236.88                  hook failed: "tls-certificates-access-relation-changed"

Juju debug-logs

unit-vault-k8s-2: 2024-04-13 00:28:13 ERROR unit.vault-k8s/2.juju-log tls-certificates-access:4: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/urllib3/connectionpool.py", line 791, in urlopen
    response = self._make_request(
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/urllib3/connectionpool.py", line 492, in _make_request
    raise new_e
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/urllib3/connectionpool.py", line 468, in _make_request
    self._validate_conn(conn)
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/urllib3/connectionpool.py", line 1097, in _validate_conn
    conn.connect()
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/urllib3/connection.py", line 611, in connect
    self.sock = sock = self._new_conn()
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/urllib3/connection.py", line 218, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f7f7b29ca00>: Failed to establish a new connection: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/urllib3/connectionpool.py", line 845, in urlopen
    retries = retries.increment(
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='vault-k8s-0.vault-k8s-endpoints.test-integration-yb3i.svc.cluster.local', port=8200): Max retries exceeded with url: /v1/auth/approle/login (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f7f7b29ca00>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/./src/charm.py", line 1345, in <module>
    main(VaultCharm)
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/ops/main.py", line 544, in main
    manager.run()
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/ops/main.py", line 520, in run
    self._emit()
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/ops/main.py", line 511, in _emit
    ops.charm._evaluate_status(self.charm)
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/ops/charm.py", line 1228, in _evaluate_status
    charm.on.collect_unit_status.emit()
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/ops/framework.py", line 851, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/./src/charm.py", line 227, in _on_collect_status
    if not self._get_active_vault_client():
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/./src/charm.py", line 1174, in _get_active_vault_client
    if not vault.authenticate(AppRole(role_id, secret_id)):
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/lib/charms/vault_k8s/v0/vault_client.py", line 111, in authenticate
    auth_details.login(self._client)
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/lib/charms/vault_k8s/v0/vault_client.py", line 62, in login
    client.auth.approle.login(role_id=self.role_id, secret_id=self.secret_id, use_token=True)
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/hvac/api/auth_methods/approle.py", line 506, in login
    return self._adapter.login(
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/hvac/adapters.py", line 230, in login
    response = self.post(url, **kwargs)
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/hvac/adapters.py", line 159, in post
    return self.request("post", url, **kwargs)
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/hvac/adapters.py", line 408, in request
    response = super().request(*args, **kwargs)
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/hvac/adapters.py", line 367, in request
    response = self.session.request(
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/requests/sessions.py", line 725, in send
    history = [resp for resp in gen]
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/requests/sessions.py", line 725, in <listcomp>
    history = [resp for resp in gen]
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/requests/sessions.py", line 266, in resolve_redirects
    resp = self.send(
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/var/lib/juju/agents/unit-vault-k8s-2/charm/venv/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='vault-k8s-0.vault-k8s-endpoints.test-integration-yb3i.svc.cluster.local', port=8200): Max retries exceeded with url: /v1/auth/approle/login (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f7f7b29ca00>: Failed to establish a new connection: [Errno 111] Connection refused'))
unit-vault-k8s-2: 2024-04-13 00:28:13 ERROR juju.worker.uniter.operation hook "tls-certificates-access-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1

Environment

kayra1 commented 6 months ago

looks like this crash is happening in collect-status right after the tls integration is gone. My initial impression is that when tls-certificates-access-relation-changed fires, the workload is restarted, and collect status immediately tries to connect to a workload that's restarting. This doesn't explain how it may sometimes work. Maybe it's happening so close to the pebble call that an http request gets through before k8s/pebble could schedule a restart.

Connection Refused is being caught at the client level right now, but it seems like python is actually throwing a requests.exceptions.ConnectionError to the client code. This is the exception that should probably be caught in Vault.authenticate.