Error while deleting pod. (Guad `can_connect` is not enough)

Abuelodelanada commented 2 years ago

Bug Description

Let's say we need to emulate the dead of a POD that uses LokiPushApiConsumer (grafana-agent-k8s) To do that we will delete the pod running kubectl delete pod...

Juju will re-create the POD, but we will get an stack trace in the log, that will produce an error in some integration tests.

To Reproduce

juju add model paka
charmcraft pack (grafana-agent-k8s-operator with the last version of LokiPushApi lib)
juju deploy ./grafana-agent-k8s_ubuntu-20.04-amd64.charm --resource agent-image=grafana/agent:v0.20.1
microk8s.kubectl delete pod -n paka grafana-agent-k8s-0
Check juju debug-log

Environment

juju: 2.9.29 microk8s: microk8s.kubectl delete pod -n paka grafana-agent-k8s-0 grafana-agent: https://github.com/canonical/grafana-agent-k8s-operator/pull/44

Relevant log output

unit-grafana-agent-k8s-0: 09:56:31 ERROR unit.grafana-agent-k8s/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/pebble.py", line 1315, in _request_raw
    response = self.opener.open(request, timeout=self.timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/model.py", line 1152, in restart
    self._pebble.restart_services(service_names)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/pebble.py", line 1462, in restart_services
    return self._services_action('restart', services, timeout, delay)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/pebble.py", line 1475, in _services_action
    resp = self._request('POST', '/v1/services', body=body)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/pebble.py", line 1281, in _request
    response = self._request_raw(method, path, query, headers, data)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/pebble.py", line 1326, in _request_raw
    raise APIError(body, code, status, message)
ops.pebble.APIError: cannot restart services: service "agent" does not exist

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/pebble.py", line 1315, in _request_raw
    response = self.opener.open(request, timeout=self.timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/charm.py", line 361, in <module>
    main(GrafanaAgentOperatorCharm)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/main.py", line 431, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/main.py", line 142, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/framework.py", line 283, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/framework.py", line 743, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/framework.py", line 790, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/lib/charms/loki_k8s/v0/loki_push_api.py", line 1705, in _on_lifecycle_event
    self.on.loki_push_api_endpoint_joined.emit()
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/framework.py", line 283, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/framework.py", line 743, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/framework.py", line 790, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 117, in _on_loki_push_api_endpoint_joined
    self._update_config(event)
  File "./src/charm.py", line 192, in _update_config
    self._container.restart(self._name)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/model.py", line 1162, in restart
    self._pebble.start_services(service_names)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/pebble.py", line 1422, in start_services
    return self._services_action('start', services, timeout, delay)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/pebble.py", line 1475, in _services_action
    resp = self._request('POST', '/v1/services', body=body)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/pebble.py", line 1281, in _request
    response = self._request_raw(method, path, query, headers, data)
  File "/var/lib/juju/agents/unit-grafana-agent-k8s-0/charm/venv/ops/pebble.py", line 1326, in _request_raw
    raise APIError(body, code, status, message)
ops.pebble.APIError: cannot start services: service "agent" does not exist
unit-grafana-agent-k8s-0: 09:56:31 ERROR juju.worker.uniter.operation hook "upgrade-charm" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

As far as I understand the problem starts in the method _on_lifecycle_event when we execute: self.on.loki_push_api_endpoint_joined.emit(). This method is executed on upgrade_charm.

This event is observed in the charm with the method: _on_loki_push_api_endpoint_joined. This method executes self._update_config(event):

    def _update_config(self, event=None):
        if not self._container.can_connect():
            # Pebble is not ready yet so no need to update config
            self.unit.status = WaitingStatus("waiting for agent container to start")
            return

        config = self._config_file()
        old_config = None

        try:
            old_config = yaml.safe_load(self._container.pull(CONFIG_PATH))
        except (FileNotFoundError, PathError):
            # If the file does not yet exist, pebble_ready has not run yet,
            # and we may be processing a deferred event
            pass

        try:
            if config != old_config:
                self._container.push(CONFIG_PATH, yaml.dump(config), make_dirs=True)
                # FIXME: change this to self._reload_config when #19 is fixed
                # Restart the service to pick up the new config
                self._container.restart(self._name)
                self.unit.status = ActiveStatus()
        except GrafanaAgentReloadError as e:
            self.unit.status = BlockedStatus(str(e))

We can avoid the stacktrace by catching the APIError exception:

...
        try:
            if config != old_config:
                self._container.push(CONFIG_PATH, yaml.dump(config), make_dirs=True)
                # FIXME: change this to self._reload_config when #19 is fixed
                # Restart the service to pick up the new config
                self._container.restart(self._name)
                self.unit.status = ActiveStatus()
        except GrafanaAgentReloadError as e:
            self.unit.status = BlockedStatus(str(e))
        except APIError as e:
            self.unit.status = WaitingStatus(str(e))

But the question is: Should we emit loki_push_api_endpoint_joined on upgrade_charm event?

Abuelodelanada commented 2 years ago

We can see this error in https://github.com/canonical/grafana-agent-k8s-operator/pull/44

And the log: https://github.com/canonical/grafana-agent-k8s-operator/runs/6328555492?check_suite_focus=true