canonical / kfp-operators

Kubeflow Pipelines Operators
Apache License 2.0
2 stars 12 forks source link

kfp-persistence charm stuck waiting for Pebble services (persistenceagent) with `ops.pebble.PathError: permission-denied` while creating `/var/run/secrets/kubeflow` #482

Closed motjuste closed 2 months ago

motjuste commented 3 months ago

Bug Description

At Solutions QA, we are seeing all test runs for Charmed-Kubeflow failing when using Juju 3.5/candidate since at least 23-May-2024 where the deployment of the charm does not finish within 2 hours. We see multiple errors in the debug-log for Juju with multiple units reporting exceptions similar to the traceback below. We are actually seeing multiple units reporting similar ops.pebble.PathError: permission-denied - cannot create directory: ... for different directories.

In the Juju status, we see that kfp-persistence stays stuck in waiting with message [container:persistenceagent] Waiting for Pebble services (persistenceagent). If this persists, it could be a blockin...

To Reproduce

We are deploying using FCE in a very standard manner.

Environment

In baremetal and AWS clouds, each with:

Relevant Log Output

<snip>
unit-kfp-persistence-0: 2024-05-28 01:14:32 INFO unit.kfp-persistence/0.juju-log Executing component: 'sa-token:persistenceagent'
unit-kfp-persistence-0: 2024-05-28 01:14:32 INFO unit.kfp-persistence/0.juju-log Token file already exists, nothing else to do.
unit-kfp-persistence-0: 2024-05-28 01:14:32 INFO unit.kfp-persistence/0.juju-log Execution for component 'sa-token:persistenceagent' complete.  Component now has status 'ActiveStatus('')'
unit-kfp-persistence-0: 2024-05-28 01:14:32 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:32 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/api/v1/serviceaccounts?labelSelector=app.kubernetes.io/instance%3Dkfp-persistence-kubeflow%2Ckubernetes-resource-handler-scope%3Dauth "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:32 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?labelSelector=app.kubernetes.io/instance%3Dkfp-persistence-kubeflow%2Ckubernetes-resource-handler-scope%3Dauth "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:32 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles?labelSelector=app.kubernetes.io/instance%3Dkfp-persistence-kubeflow%2Ckubernetes-resource-handler-scope%3Dauth "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:32 INFO unit.kfp-persistence/0.juju-log Rendering manifests
unit-kfp-persistence-0: 2024-05-28 01:14:32 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:32 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/api/v1/serviceaccounts?labelSelector=app.kubernetes.io/instance%3Dkfp-persistence-kubeflow%2Ckubernetes-resource-handler-scope%3Dauth "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:32 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?labelSelector=app.kubernetes.io/instance%3Dkfp-persistence-kubeflow%2Ckubernetes-resource-handler-scope%3Dauth "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:33 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles?labelSelector=app.kubernetes.io/instance%3Dkfp-persistence-kubeflow%2Ckubernetes-resource-handler-scope%3Dauth "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:33 INFO unit.kfp-persistence/0.juju-log Rendering manifests
unit-kfp-persistence-0: 2024-05-28 01:14:33 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:33 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/api/v1/serviceaccounts?labelSelector=app.kubernetes.io/instance%3Dkfp-persistence-kubeflow%2Ckubernetes-resource-handler-scope%3Dauth "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:33 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?labelSelector=app.kubernetes.io/instance%3Dkfp-persistence-kubeflow%2Ckubernetes-resource-handler-scope%3Dauth "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:33 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles?labelSelector=app.kubernetes.io/instance%3Dkfp-persistence-kubeflow%2Ckubernetes-resource-handler-scope%3Dauth "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:33 INFO unit.kfp-persistence/0.juju-log Rendering manifests
unit-kfp-persistence-0: 2024-05-28 01:14:33 INFO unit.kfp-persistence/0.juju-log Executing component: 'container:persistenceagent'
unit-kfp-persistence-0: 2024-05-28 01:14:33 ERROR unit.kfp-persistence/0.juju-log execute_components caught unhandled exception when executing configure_charm for container:persistenceagent
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-kfp-persistence-0/charm/venv/charmed_kubeflow_chisme/components/charm_reconciler.py", line 92, in reconcile
    component_item.component.configure_charm(event)
  File "/var/lib/juju/agents/unit-kfp-persistence-0/charm/venv/charmed_kubeflow_chisme/components/component.py", line 50, in configure_charm
    self._configure_unit(event)
  File "/var/lib/juju/agents/unit-kfp-persistence-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 130, in _configure_unit
    self._push_files_to_container()
  File "/var/lib/juju/agents/unit-kfp-persistence-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 94, in _push_files_to_container
    container.push(
  File "/var/lib/juju/agents/unit-kfp-persistence-0/charm/venv/ops/model.py", line 2073, in push
    self._pebble.push(str(path), source, encoding=encoding,
  File "/var/lib/juju/agents/unit-kfp-persistence-0/charm/venv/ops/pebble.py", line 2012, in push
    self._raise_on_path_error(typing.cast('_FilesResponse', resp), path)
  File "/var/lib/juju/agents/unit-kfp-persistence-0/charm/venv/ops/pebble.py", line 1964, in _raise_on_path_error
    raise PathError(error['kind'], error['message'])
ops.pebble.PathError: permission-denied - cannot create directory: mkdir /var/run/secrets/kubeflow: permission denied
unit-kfp-persistence-0: 2024-05-28 01:14:33 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:34 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/api/v1/serviceaccounts?labelSelector=app.kubernetes.io/instance%3Dkfp-persistence-kubeflow%2Ckubernetes-resource-handler-scope%3Dauth "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:34 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?labelSelector=app.kubernetes.io/instance%3Dkfp-persistence-kubeflow%2Ckubernetes-resource-handler-scope%3Dauth "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:34 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles?labelSelector=app.kubernetes.io/instance%3Dkfp-persistence-kubeflow%2Ckubernetes-resource-handler-scope%3Dauth "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:34 INFO unit.kfp-persistence/0.juju-log Rendering manifests
unit-kfp-persistence-0: 2024-05-28 01:14:34 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 2024-05-28 01:14:34 WARNING unit.seldon-controller-manager/0.seldon-core-pebble-ready Generating RSA private key, 2048 bit long modulus (2 primes)
unit-kfp-persistence-0: 2024-05-28 01:14:34 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/api/v1/serviceaccounts?labelSelector=app.kubernetes.io/instance%3Dkfp-persistence-kubeflow%2Ckubernetes-resource-handler-scope%3Dauth "HTTP/1.1 200 OK"
unit-kfp-persistence-0: 2024-05-28 01:14:34 INFO unit.kfp-persistence/0.juju-log HTTP Request: GET https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?labelSelector=app.kubernetes.io/instance%3Dkfp-persistence-kubeflow%2Ckubernetes-resource-handler-scope%3Dauth "HTTP/1.1 200 OK"
<snip>

Additional Context

This may potentially be related to this bug reported against Juju 3.5.0.

More identified recurrences of this bug at Solutions QA can be found here.

syncronize-issues-to-jira[bot] commented 3 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5749.

This message was autogenerated

DnPlas commented 2 months ago

This issue is caused by juju v3.5.1, we have identified it already (see https://github.com/canonical/bundle-kubeflow/issues/921). The fix is already in juju v3.5.2, but hasn't been released yet. As a workaround, we are pinning the agent to 3.5.0 in some of our CIs (kfp and notebooks), I suggest you do it if you'd like to deploy kfp-operators using juju 3.5. Alternatively, you could use 3.1 (which is the supported version for CKF 1.8). 3.4 also works (v3.4.3, earlier versions had bugs), but we haven't fully tested it.

In the next two weeks we will define the juju supported version and make a communication, so for now please use juju 3.5 with caution.

DnPlas commented 2 months ago

Closing it as it is a duplicate of https://github.com/canonical/bundle-kubeflow/issues/921, please refer to that other issue for updates. Thanks @motjuste !

motjuste commented 2 months ago

Thanks for the clarification, and glad we could help. It all works pretty well with Juju 3.3 for sure too. We'll be careful using Juju 3.5.