canonical / kfp-operators

Kubeflow Pipelines Operators
Apache License 2.0
2 stars 12 forks source link

kfp-persistence stuck in waiting status #343

Closed DnPlas closed 11 months ago

DnPlas commented 1 year ago

Bug Description

The changes introduced by #331 are affecting the charm, preventing the service to start correctly (replan).

To Reproduce

  1. git checkout kf-3886-release-1.8-update-charms
  2. tox -e kfp-persistence-integration -- --keep-models # keep-models for debugging purposes
  3. wait for everything to settle
  4. kfp-persistence will be stuck in waiting status with message:
[container:persistenceagent] Waiting for Pebble services (persistenceagent).  If this persists, it could be a blockin...

Environment

Ubuntu 20.04 juju 3.1 microk8s 1.25-strict/stable

Relevant log output

unit-kfp-persistence-0: 10:33:42 ERROR unit.kfp-persistence/0.juju-log kfp-api:5: execute_components caught unhandled exception when executing configure_charm for container:persistenceagent
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-kfp-persistence-0/charm/venv/charmed_kubeflow_chisme/components/charm_reconciler.py", line 92, in reconcile
    component_item.component.configure_charm(event)
  File "/var/lib/juju/agents/unit-kfp-persistence-0/charm/venv/charmed_kubeflow_chisme/components/component.py", line 50, in configure_charm
    self._configure_unit(event)
  File "/var/lib/juju/agents/unit-kfp-persistence-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 131, in _configure_unit
    self._update_layer()
  File "/var/lib/juju/agents/unit-kfp-persistence-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 142, in _update_layer
    container.replan()
  File "/var/lib/juju/agents/unit-kfp-persistence-0/charm/venv/ops/model.py", line 1915, in replan
    self._pebble.replan_services()
  File "/var/lib/juju/agents/unit-kfp-persistence-0/charm/venv/ops/pebble.py", line 1680, in replan_services
    return self._services_action('replan', [], timeout, delay)
  File "/var/lib/juju/agents/unit-kfp-persistence-0/charm/venv/ops/pebble.py", line 1761, in _services_action
    raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "persistenceagent" (cannot start service: exited quickly with code 1)
----- Logs from task 0 -----
2023-10-02T10:33:42Z INFO Most recent service output:
    W1002 10:33:42.193357      16 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
    time="2023-10-02T10:33:42Z" level=error msg="Error reading persistence agent service account token '/var/run/secrets/kubeflow/tokens/persistenceagent-sa-token': open /var/run/secrets/kubeflow/tokens/persistenceagent-sa-token: no such file or directory"
    time="2023-10-02T10:33:42Z" level=fatal msg="Error starting Service Account Token Refresh Ticker due to: open /var/run/secrets/kubeflow/tokens/persistenceagent-sa-token: no such file or directory"
2023-10-02T10:33:42Z ERROR cannot start service: exited quickly with code 1
-----

Additional context

For more logs and relevant outputs please check the failed CI on https://github.com/canonical/kfp-operators/pull/331

DnPlas commented 1 year ago

It seems like we are missing this file (as shown in the error message). This was recently introduced and explains why in the past we did not run into similar issues.

Potential fix

Add the required file

kimwnasptd commented 1 year ago

This will be problematic. We'll need to get the projected service account token functionality of K8s in our sidecar charms https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#serviceaccount-token-volume-projection

Since Juju is handling the creation of the unit I'd expect this is something we'll need to set on the juju side

DnPlas commented 12 months ago

This issue has been fixed by #349, but I will wait until this commit lands in main to close it.