canonical / mongodb-k8s-operator

Operator charm for MongoDB
Apache License 2.0
13 stars 15 forks source link

non-leader unit stuck with hook failed: "certificates-relation-changed" for self-signed-certificates:certificates" #268

Open jeffreychang911 opened 1 month ago

jeffreychang911 commented 1 month ago

Steps to reproduce

  1. SolQA deploys Charmed Kubernetes 1.28 on AWS, and then Mongodb-K8s.
  2. 2 out of 3 mongodb-k8s nodes blocked with "hook failed: "certificates-relation-changed" for self-signed-certificates:certificates", and it won't settle after timeout in 1 hr.

Expected behavior

Actual behavior

Versions

Operating system:

Juju CLI: 3.5.2

Juju agent: 3.5.2

Charm revision: rev 43 on 6/edge

charmed kubernetes 1.28

Log output

Juju debug log:

unit-self-signed-certificates-0: 2024-07-20 02:47:01 INFO juju.worker.uniter.operation ran "certificates-relation-changed" hook (via hook dispatching script: dispatch)
unit-mongodb-k8s-0: 2024-07-20 02:47:02 INFO unit.mongodb-k8s/0.juju-log certificates:1: Restarting mongod with TLS enabled.
unit-mongodb-k8s-0: 2024-07-20 02:47:02 INFO unit.mongodb-k8s/0.juju-log certificates:1: Deleting TLS certificate from workload container
unit-mongodb-k8s-0: 2024-07-20 02:47:02 ERROR unit.mongodb-k8s/0.juju-log certificates:1: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/./src/charm.py", line 1245, in <module>
    main(MongoDBCharm)
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/venv/ops/main.py", line 548, in main
    manager.run()
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/venv/ops/main.py", line 527, in run
    self._emit()
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/venv/ops/main.py", line 516, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/venv/ops/main.py", line 147, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/venv/ops/framework.py", line 348, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/venv/ops/framework.py", line 860, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/venv/ops/framework.py", line 950, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/lib/charms/tls_certificates_interface/v3/tls_certificates.py", line 1900, in _on_relation_changed
    self.on.certificate_available.emit(
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/venv/ops/framework.py", line 348, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/venv/ops/framework.py", line 860, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/venv/ops/framework.py", line 950, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/lib/charms/mongodb/v0/mongodb_tls.py", line 225, in _on_certificate_available
    self.charm.delete_tls_certificate_from_workload()
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/./src/charm.py", line 1058, in delete_tls_certificate_from_workload
    container.remove_path(f"{Config.CONF_DIR}/{file}")
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/venv/ops/model.py", line 2785, in remove_path
    self._pebble.remove_path(str(path), recursive=recursive)
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/venv/ops/pebble.py", line 2529, in remove_path
    resp = self._request('POST', '/v1/files', None, body)
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/venv/ops/pebble.py", line 1859, in _request
    response = self._request_raw(method, path, query, headers, data)
  File "/var/lib/juju/agents/unit-mongodb-k8s-0/charm/venv/ops/pebble.py", line 1912, in _request_raw
    raise ConnectionError(
ops.pebble.ConnectionError: Could not connect to Pebble: socket not found at '/charm/containers/mongod/pebble.socket' (container restarted?)

Additional context

SolQA testrun - https://solutions.qa.canonical.com/testruns/5cac57f9-8c93-43e5-bc0e-bbbbf1098c82 Juju crashdump - https://oil-jenkins.canonical.com/artifacts/5cac57f9-8c93-43e5-bc0e-bbbbf1098c82/generated/generated/mongodb-k8s/crashdump-2024-07-20-03.49.02.tar.gz

github-actions[bot] commented 1 month ago

https://warthogs.atlassian.net/browse/DPE-4904

Gu1nness commented 1 month ago

I took the time to investigate and gathered some information. We just merged a PR that should fix it : https://github.com/canonical/mongodb-k8s-operator/pull/288 When this is released and we have a new version ready for deployment, we can retry the bench @jeffreychang911

jeffreychang911 commented 3 weeks ago

Tested with revision 50 in this run, and crashdump.

I still see same error in original descriptions, and some new error below

unit-mongodb-k8s-2: 2024-08-20 19:40:51 ERROR unit.mongodb-k8s/2.juju-log certificates:1: Uncaught exception while in charm code: Traceback (most recent call last): File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/./src/charm.py", line 1555, in main(MongoDBCharm) File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/venv/ops/main.py", line 551, in main manager.run() File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/venv/ops/main.py", line 530, in run self._emit() File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/venv/ops/main.py", line 519, in _emit _emit_charm_event(self.charm, self.dispatcher.event_name) File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/venv/ops/main.py", line 147, in _emit_charm_event event_to_emit.emit(*args, **kwargs) File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/venv/ops/framework.py", line 348, in emit framework._emit(event) File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/venv/ops/framework.py", line 860, in _emit self._reemit(event_path) File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/venv/ops/framework.py", line 950, in _reemit custom_handler(event) File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/lib/charms/tls_certificates_interface/v3/tls_certificates.py", line 1911, in _on_relation_changed self.on.certificate_available.emit( File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/venv/ops/framework.py", line 348, in emit framework._emit(event) File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/venv/ops/framework.py", line 860, in _emit self._reemit(event_path) File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/venv/ops/framework.py", line 950, in _reemit custom_handler(event) File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/lib/charms/mongodb/v1/mongodb_tls.py", line 228, in _on_certificate_available self.charm.restart_charm_services() File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/./src/charm.py", line 1212, in restart_charm_services container.replan() File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/venv/ops/model.py", line 2259, in replan self._pebble.replan_services() File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/venv/ops/pebble.py", line 2129, in replan_services return self._services_action('replan', [], timeout, delay) File "/var/lib/juju/agents/unit-mongodb-k8s-2/charm/venv/ops/pebble.py", line 2226, in _services_action raise ChangeError(change.err, change) ops.pebble.ChangeError: cannot perform the following tasks: