get_interfaces() pulls data from our SDI-driven relations, catching NoVersionsListed and NoCompatibleVersions errors and reraising them as CheckFailed Errors with an opinionated charm status. But _get_interfaces() is called here, where the caller catches ErrorWithStatusbut notCheckFailed errors. This means that CheckFailed errors raised by get_interfaces() are not caught and instead surface up to Juju, putting kubeflow-profiles into error state.
This can be seen if executing the kubeflow-dashboard CI, where it fails because profiles has entered an error state. For example:
copied from pytest logs:
_______________________________________________________________________________ test_add_profile_relation _______________________________________________________________________________
Traceback (most recent call last):
File "/home/scribs/code/canonical/kubeflow-dashboard-operator/test-new-profiles/tests/integration/test_charm.py", line 106, in test_add_profile_relation
await ops_test.model.wait_for_idle(
File "/home/scribs/code/canonical/kubeflow-dashboard-operator/test-new-profiles/.tox/integration/lib/python3.8/site-packages/juju/model.py", line 2707, in wait_for_idle
_raise_for_status(errors, "error")
File "/home/scribs/code/canonical/kubeflow-dashboard-operator/test-new-profiles/.tox/integration/lib/python3.8/site-packages/juju/model.py", line 2646, in _raise_for_status
raise error_type("{}{} in {}: {}".format(
juju.errors.JujuUnitError: Unit in error: kubeflow-profiles/0
----------------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------------
INFO juju.model:model.py:2088 Deploying ch:amd64/focal/kubeflow-profiles-166
INFO juju.model:model.py:2715 Waiting for model:
kubeflow-profiles/0 [allocating] waiting: installing agent
kubeflow-dashboard/0 [idle] waiting: Waiting for kubeflow-profiles relation data
================================================================================ short test summary info ================================================================================
FAILED tests/integration/test_charm.py::test_add_profile_relation - juju.errors.JujuUnitError: Unit in error: kubeflow-profiles/0
juju show-status-log -i kubeflow-profiles/0
Time Type Status Message
02 Feb 2023 12:48:12-05:00 juju-unit allocating
02 Feb 2023 12:48:12-05:00 workload waiting installing agent
02 Feb 2023 12:48:22-05:00 workload waiting agent initializing
02 Feb 2023 12:48:28-05:00 workload maintenance installing charm software
02 Feb 2023 12:48:28-05:00 juju-unit executing running install hook
02 Feb 2023 12:48:29-05:00 workload waiting Waiting to connect to Profiles container
02 Feb 2023 12:48:29-05:00 workload waiting Waiting to connect to kfam container
02 Feb 2023 12:48:29-05:00 workload maintenance Creating K8S resources
02 Feb 2023 12:48:29-05:00 workload maintenance K8S resources created
02 Feb 2023 12:48:29-05:00 workload active
02 Feb 2023 12:48:30-05:00 juju-unit executing running kubeflow-profiles-relation-created hook
02 Feb 2023 12:48:30-05:00 workload waiting Waiting to connect to Profiles container
02 Feb 2023 12:48:30-05:00 workload waiting Waiting to connect to kfam container
02 Feb 2023 12:48:30-05:00 workload maintenance Creating K8S resources
02 Feb 2023 12:48:30-05:00 workload maintenance K8S resources created
02 Feb 2023 12:48:31-05:00 juju-unit error hook failed: "kubeflow-profiles-relation-created"
02 Feb 2023 12:48:36-05:00 juju-unit executing running kubeflow-profiles-relation-created hook
02 Feb 2023 12:48:36-05:00 workload waiting Waiting to connect to Profiles container
02 Feb 2023 12:48:36-05:00 workload waiting Waiting to connect to kfam container
02 Feb 2023 12:48:36-05:00 workload maintenance Creating K8S resources
02 Feb 2023 12:48:36-05:00 workload maintenance K8S resources created
02 Feb 2023 12:48:36-05:00 workload active
02 Feb 2023 12:48:37-05:00 juju-unit executing running leader-elected hook
02 Feb 2023 12:48:37-05:00 workload waiting Waiting to connect to Profiles container
02 Feb 2023 12:48:37-05:00 workload waiting Waiting to connect to kfam container
02 Feb 2023 12:48:37-05:00 workload maintenance Creating K8S resources
02 Feb 2023 12:48:38-05:00 workload maintenance K8S resources created
02 Feb 2023 12:48:38-05:00 workload active
02 Feb 2023 12:48:38-05:00 workload waiting Waiting to connect to Profiles container
02 Feb 2023 12:48:38-05:00 workload waiting Waiting to connect to kfam container
02 Feb 2023 12:48:38-05:00 workload maintenance Creating K8S resources
02 Feb 2023 12:48:38-05:00 workload maintenance K8S resources created
02 Feb 2023 12:48:38-05:00 workload active
02 Feb 2023 12:48:39-05:00 juju-unit executing running kubeflow-kfam-pebble-ready hook
02 Feb 2023 12:48:39-05:00 workload waiting Waiting to connect to Profiles container
02 Feb 2023 12:48:39-05:00 workload waiting Waiting to connect to kfam container
02 Feb 2023 12:48:39-05:00 workload maintenance Creating K8S resources
02 Feb 2023 12:48:40-05:00 workload maintenance K8S resources created
juju debug-log -i kubeflow-profiles/0 at that moment:
unit-kubeflow-profiles-0: 12:48:30 INFO unit.kubeflow-profiles/0.juju-log kubeflow-profiles:0: Rendering manifests
unit-kubeflow-profiles-0: 12:48:30 INFO unit.kubeflow-profiles/0.juju-log kubeflow-profiles:0: Reconcile completed successfully
unit-kubeflow-profiles-0: 12:48:30 ERROR unit.kubeflow-profiles/0.juju-log kubeflow-profiles:0: Uncaught exception while in charm code:
Traceback (most recent call last):
File "./src/charm.py", line 311, in _get_interfaces
interfaces = get_interfaces(self)
File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/serialized_data_interface/sdi.py", line 344, in get_interfaces
return {
File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/serialized_data_interface/sdi.py", line 345, in <dictcomp>
endpoint: get_interface(charm, endpoint)
File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/serialized_data_interface/sdi.py", line 381, in get_interface
instance.get_data()
File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/serialized_data_interface/sdi.py", line 171, in get_data
rel_data = self.unwrap(relation)
File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/serialized_data_interface/sdi.py", line 262, in unwrap
version = self.get_version(relation)
File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/serialized_data_interface/sdi.py", line 110, in get_version
raise errors.UnversionedRelation(relation)
serialized_data_interface.errors.UnversionedRelation: List of <ops.model.Relation kubeflow-profiles:0> versions not found for apps: kubeflow-dashboard
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./src/charm.py", line 471, in <module>
main(KubeflowProfilesOperator)
File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/ops/main.py", line 436, in main
framework.reemit()
File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/ops/framework.py", line 866, in reemit
self._reemit()
File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/ops/framework.py", line 931, in _reemit
custom_handler(event)
File "./src/charm.py", line 447, in main
interfaces = self._get_interfaces()
File "./src/charm.py", line 313, in _get_interfaces
raise CheckFailed(err, WaitingStatus)
CheckFailed
unit-kubeflow-profiles-0: 12:48:31 ERROR juju.worker.uniter.operation hook "kubeflow-profiles-relation-created" (via hook dispatching script: dispatch) failed: exit status 1
I'm not sure why, but for some reason while the charm does go into Error state it later recovers. I wouldn't think this is possible, but it consistently happens. This means that the charm probably works in real deployments, but will often fail in CI for any CI that asserts charms should not go into error state
Proposed solution
while CheckFailed and ErrorWithStatus are equivalent in function, we can't mix and match them. Refactor to always use the ErrorWithStatus from chisme and this wont be an issue
get_interfaces()
pulls data from our SDI-driven relations, catchingNoVersionsListed
andNoCompatibleVersions
errors and reraising them asCheckFailed
Errors with an opinionated charm status. But_get_interfaces()
is called here, where the caller catchesErrorWithStatus
but notCheckFailed
errors. This means thatCheckFailed
errors raised byget_interfaces()
are not caught and instead surface up to Juju, puttingkubeflow-profiles
into error state.This can be seen if executing the
kubeflow-dashboard
CI, where it fails because profiles has entered an error state. For example:copied from pytest logs:
juju show-status-log -i kubeflow-profiles/0
juju debug-log -i kubeflow-profiles/0
at that moment:I'm not sure why, but for some reason while the charm does go into Error state it later recovers. I wouldn't think this is possible, but it consistently happens. This means that the charm probably works in real deployments, but will often fail in CI for any CI that asserts charms should not go into error state
Proposed solution
while
CheckFailed
andErrorWithStatus
are equivalent in function, we can't mix and match them. Refactor to always use the ErrorWithStatus from chisme and this wont be an issue