Closed phoevos closed 1 year ago
I also hit this issue today.
Deploy kfp-api -> wait for it to go active and idle -> kfp-api goes into BlockedStatus requesting to add object-storage relation -> deploy minio -> relate minio and kfp-api
NOTE: the workaround did work for me, but we have to wait for the next update status to be triggered to see the change.
Status of the units:
App Version Status Scale Charm Channel Rev Address Exposed Message
kfp-api waiting 1 kfp-api 2.0/stable 413 10.152.183.244 no installing agent
kfp-db mariadb/server:10.3 active 1 charmed-osm-mariadb-k8s stable 35 10.152.183.242 no ready
kfp-schedwf res:oci-image@90ddd63 active 1 kfp-schedwf 2.0/stable 424 no
kfp-viz res:oci-image@3de6f3c active 1 kfp-viz 2.0/stable 394 10.152.183.102 no
minio res:oci-image@1755999 active 1 minio ckf-1.7/stable 186 10.152.183.239 no
Unit Workload Agent Address Ports Message
kfp-api/0* maintenance idle 10.1.15.13 Workload failed health check
kfp-db/0* active idle 10.1.15.18 3306/TCP ready
kfp-schedwf/0* active idle 10.1.15.12
kfp-viz/0* active idle 10.1.15.23 8888/TCP
minio/0* active idle 10.1.15.21 9000/TCP,9001/TCP
Workload failed heath check
comes from L294 inside the _check_status()
method that is called in on_update_status()
on every UpdateStatus event.ml-pipeline-api-server
is never started if minio and the relation with minio is missing because it fails with 2023-06-20T12:56:36.924Z [ml-pipeline-api-server] F0620 12:56:36.924163 19 client_manager.go:412] Failed to check if Minio bucket exists. Error: Get "http://minio.kubeflow:9000/mlpipeline/?location=": dial tcp 10.152.183.239:9000: connect: connection refused
2023-06-20T13:01:36.474Z [pebble] Check "kfp-api-up" failure 1 (threshold 3): Get "http://localhost:8888/apis/v1beta1/healthz": dial tcp [::1]:8888: connect: connection refused
. Which makes sense because the service is not up.`ml-pipeline-api-server
is not replaned or restarted by the charm code at any point after the initial sequence. We only replan the service IFF there is a change in the pebble layer.We can mitigate this error if we don't attempt to start the service (or replan it) unless we are sure that minio is active and the object-storage
relation exists, which means that we have to always check for the relation before calling update_layer
, deferring (or returning before) the update_layer
call until the relation is present.
I can think of something like:
def _on_event(self, event, force_conflicts: bool = False) -> None:
# Set up all relations/fetch required data
try:
self._check_leader()
interfaces = self._get_interfaces()
config_json = self._generate_config(interfaces)
self._get_object_storage(interfaces) # <--- This also raises ErrorWithStatus
self._upload_files_to_container(config_json)
self._apply_k8s_resources(force_conflicts=force_conflicts)
update_layer(self._container_name, self._container, self._kfp_api_layer, self.logger)
self._send_info(interfaces)
except ErrorWithStatus as err:
self.model.unit.status = err.status
self.logger.error(f"Failed to handle {event} with error: {err}")
return
self.model.unit.status = ActiveStatus()
Closing this issue, since https://github.com/canonical/kfp-operators/commit/06303e4f0d2284b532758fd4f30499898888b797 got merged and is now part of track/2.0
. We will revisit this if there's any progress with https://github.com/canonical/pebble/issues/240, or we decide to restructure the code to integrate Daniela's proposed solution:
We can mitigate this error if we don't attempt to start the service (or replan it) unless we are sure that minio is active and the object-storage relation exists, which means that we have to always check for the relation before calling update_layer, deferring (or returning before) the update_layer call until the relation is present.
We're bumping into this issue intermittently on a clean install with the
2.0/edge
(rev 413) version of the charm, when deploying the Kubeflow bundle (1.7/edge
) either on MicroK8s or Charmed Kubernetes.On startup, the KFP API Server tries to connect to MinIO. If MinIO is not yet available, the connection fails with the following error, causing the service to crash, less than 1 second after it started.
Due to the service failing fast (<1 sec), Pebble considers it to be inactive:
This is a design decision on the Pebble side as explained here:
Due to the fact that the service was never active, Pebble never attempts to restart it, despite the failing health checks.
Workaround
Since this issue is exposed due to a race (i.e. MinIO not yet available), it won't come up every time. If it does occur during deployment (after the rest of the bundle has been installed successfully), however, we need to start the API Server service manually to unblock:
Mitigation
We need to come up with the plan to avoid bumping into this issue in the future. There's a couple of things that could be done on our side: