canonical / kfp-operators

Kubeflow Pipelines Operators
Apache License 2.0
2 stars 12 forks source link

KFP run stuck in None state #323

Closed phoevos closed 1 year ago

phoevos commented 1 year ago

Bug Description

The status of submitted KFP runs is never updated and is therefore stuck to None in the latest/edge version of the KFP charms.

The Argo Workflow is submitted properly and executes successfully, but that doesn't reflect on the run itself, as seen using either the KFP client (which returns a finished run with a None status) or the KFP UI (which hangs loading).

This is likely a bug introduced in our recent sidecar rewrites. At the moment it's just a guess, but I'm thinking that this has something to do with the KFP Persistence Agent not working properly, given that the content of the KFP MySQL DB is never updated with the completed workflow.

To Reproduce

  1. Deploy KFP (it's easier to do so as part of the CKF bundle to facilitate debugging)
  2. Run one of the example pipelines through the KFP UI
  3. Note that it's status never changes (either using the client or navigating to the runs page and attempting to fetch the created run, which will never load)
  4. Inspect the contents of kfp-db and verify that the workflow saved as part of the created run entry does not have an updated statusits

Environment

Relevant log output

N/A

Additional context

I noticed that before the rewrite we were applying this ServiceAccount to the PersistenceAgent container, which allowed for accessing the Argo Workflow K8s resources: https://github.com/canonical/kfp-operators/blob/67235bfa402fb4f67c30521fad431c467a1b0d44/charms/kfp-persistence/src/charm.py#L62-L85

It doesn't look like we're currently applying these elsewhere. This shouldn't be an issue here, since we're deploying the charm with trust, but we should also make a note of deploying the required upstream ClusterRole and Binding.

phoevos commented 1 year ago

Explanation

As part of the charm rewrite we changed the value of the --namespace option provided when starting the persistence agent service from "" to match the model namespace value. The config is set up here: https://github.com/canonical/kfp-operators/blob/5806a6be8b0ca4111e33b9077ee1c245acbbdc01/charms/kfp-persistence/src/charm.py#L68 And used here: https://github.com/canonical/kfp-operators/blob/5806a6be8b0ca4111e33b9077ee1c245acbbdc01/charms/kfp-persistence/src/components/pebble_components.py#L54

However, this won't work for multi-user Kubeflow installations. Looking into the upstream manifests it's clear that this value is intended to be empty.