KFP API: Pebble service fails to start

phoevos commented 1 year ago

We're bumping into this issue intermittently on a clean install with the 2.0/edge (rev 413) version of the charm, when deploying the Kubeflow bundle (1.7/edge) either on MicroK8s or Charmed Kubernetes.

On startup, the KFP API Server tries to connect to MinIO. If MinIO is not yet available, the connection fails with the following error, causing the service to crash, less than 1 second after it started.

Failed to check if Minio bucket exists. Error: Get "http://minio.kubeflow:9000/mlpipeline/?location=": dial tcp 10.152.183.92:9000: connect: connection refused

Due to the service failing fast (<1 sec), Pebble considers it to be inactive:

$ pebble services       
Service                 Startup  Current   Since
ml-pipeline-api-server  enabled  inactive  -

This is a design decision on the Pebble side as explained here:

When starting a service, Pebble executes the service's command, and waits 1 second to ensure the command doesn't exit too quickly. Assuming the command doesn't exit within that time window, the start is considered successful, otherwise pebble start will exit with an error.

Due to the fact that the service was never active, Pebble never attempts to restart it, despite the failing health checks.

Workaround

Since this issue is exposed due to a race (i.e. MinIO not yet available), it won't come up every time. If it does occur during deployment (after the rest of the bundle has been installed successfully), however, we need to start the API Server service manually to unblock:

juju ssh kfp-api/0 "PEBBLE_SOCKET=/charm/containers/ml-pipeline-api-server/pebble.socket /charm/bin/pebble replan"

Mitigation

We need to come up with the plan to avoid bumping into this issue in the future. There's a couple of things that could be done on our side:

Sleep for 1 second before starting any Pebble service
- This is not ideal, since it defeats the purpose of this Pebble feature
Catch the error and manually retry starting the service on the charm side
- This is not ideal, since it shifts the responsibility of service management from the service manager itself to the high-level charm code

DnPlas commented 1 year ago

I also hit this issue today.

My environment

Ubuntu 20.04
juju controller 2.9.34
kfp-api 2.0/stable rev413
minio ckf-1.7/stable rev186

Steps to reproduce

Deploy kfp-api -> wait for it to go active and idle -> kfp-api goes into BlockedStatus requesting to add object-storage relation -> deploy minio -> relate minio and kfp-api

NOTE: the workaround did work for me, but we have to wait for the next update status to be triggered to see the change.

Status of the units:

App          Version                Status   Scale  Charm                    Channel         Rev  Address         Exposed  Message
kfp-api                             waiting      1  kfp-api                  2.0/stable      413  10.152.183.244  no       installing agent
kfp-db       mariadb/server:10.3    active       1  charmed-osm-mariadb-k8s  stable           35  10.152.183.242  no       ready
kfp-schedwf  res:oci-image@90ddd63  active       1  kfp-schedwf              2.0/stable      424                  no       
kfp-viz      res:oci-image@3de6f3c  active       1  kfp-viz                  2.0/stable      394  10.152.183.102  no       
minio        res:oci-image@1755999  active       1  minio                    ckf-1.7/stable  186  10.152.183.239  no       

Unit            Workload     Agent  Address     Ports              Message
kfp-api/0*      maintenance  idle   10.1.15.13                     Workload failed health check
kfp-db/0*       active       idle   10.1.15.18  3306/TCP           ready
kfp-schedwf/0*  active       idle   10.1.15.12                     
kfp-viz/0*      active       idle   10.1.15.23  8888/TCP           
minio/0*        active       idle   10.1.15.21  9000/TCP,9001/TCP

Observations

The message Workload failed heath check comes from L294 inside the _check_status() method that is called in on_update_status() on every UpdateStatus event.
The service ml-pipeline-api-server is never started if minio and the relation with minio is missing because it fails with 2023-06-20T12:56:36.924Z [ml-pipeline-api-server] F0620 12:56:36.924163 19 client_manager.go:412] Failed to check if Minio bucket exists. Error: Get "http://minio.kubeflow:9000/mlpipeline/?location=": dial tcp 10.152.183.239:9000: connect: connection refused
We get a failed health check because 2023-06-20T13:01:36.474Z [pebble] Check "kfp-api-up" failure 1 (threshold 3): Get "http://localhost:8888/apis/v1beta1/healthz": dial tcp [::1]:8888: connect: connection refused. Which makes sense because the service is not up.
The service `ml-pipeline-api-server is not replaned or restarted by the charm code at any point after the initial sequence. We only replan the service IFF there is a change in the pebble layer.
Pebble will never restart a service that exited too quickly because of a design choice (see description of this bug)

Possible solution

We can mitigate this error if we don't attempt to start the service (or replan it) unless we are sure that minio is active and the object-storage relation exists, which means that we have to always check for the relation before calling update_layer, deferring (or returning before) the update_layer call until the relation is present.

I can think of something like:

    def _on_event(self, event, force_conflicts: bool = False) -> None:
        # Set up all relations/fetch required data
        try:
            self._check_leader()
            interfaces = self._get_interfaces()
            config_json = self._generate_config(interfaces)
            self._get_object_storage(interfaces) # <--- This also raises ErrorWithStatus
            self._upload_files_to_container(config_json)
            self._apply_k8s_resources(force_conflicts=force_conflicts)
            update_layer(self._container_name, self._container, self._kfp_api_layer, self.logger)
            self._send_info(interfaces)
        except ErrorWithStatus as err:
            self.model.unit.status = err.status
            self.logger.error(f"Failed to handle {event} with error: {err}")
            return

        self.model.unit.status = ActiveStatus()

phoevos commented 1 year ago

Closing this issue, since https://github.com/canonical/kfp-operators/commit/06303e4f0d2284b532758fd4f30499898888b797 got merged and is now part of track/2.0. We will revisit this if there's any progress with https://github.com/canonical/pebble/issues/240, or we decide to restructure the code to integrate Daniela's proposed solution:

We can mitigate this error if we don't attempt to start the service (or replan it) unless we are sure that minio is active and the object-storage relation exists, which means that we have to always check for the relation before calling update_layer, deferring (or returning before) the update_layer call until the relation is present.

canonical / kfp-operators