canonical / kubeflow-profiles-operator

Kubeflow Profiles Operator
Apache License 2.0
0 stars 3 forks source link

Charm goes into `error` state when its SDI-driven relations do not have complete data #82

Closed ca-scribner closed 1 year ago

ca-scribner commented 1 year ago

get_interfaces() pulls data from our SDI-driven relations, catching NoVersionsListed and NoCompatibleVersions errors and reraising them as CheckFailed Errors with an opinionated charm status. But _get_interfaces() is called here, where the caller catches ErrorWithStatus but not CheckFailed errors. This means that CheckFailed errors raised by get_interfaces() are not caught and instead surface up to Juju, putting kubeflow-profiles into error state.

This can be seen if executing the kubeflow-dashboard CI, where it fails because profiles has entered an error state. For example:

copied from pytest logs:

_______________________________________________________________________________ test_add_profile_relation _______________________________________________________________________________
Traceback (most recent call last):
  File "/home/scribs/code/canonical/kubeflow-dashboard-operator/test-new-profiles/tests/integration/test_charm.py", line 106, in test_add_profile_relation
    await ops_test.model.wait_for_idle(
  File "/home/scribs/code/canonical/kubeflow-dashboard-operator/test-new-profiles/.tox/integration/lib/python3.8/site-packages/juju/model.py", line 2707, in wait_for_idle
    _raise_for_status(errors, "error")
  File "/home/scribs/code/canonical/kubeflow-dashboard-operator/test-new-profiles/.tox/integration/lib/python3.8/site-packages/juju/model.py", line 2646, in _raise_for_status
    raise error_type("{}{} in {}: {}".format(
juju.errors.JujuUnitError: Unit in error: kubeflow-profiles/0
----------------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------------
INFO     juju.model:model.py:2088 Deploying ch:amd64/focal/kubeflow-profiles-166
INFO     juju.model:model.py:2715 Waiting for model:
  kubeflow-profiles/0 [allocating] waiting: installing agent
  kubeflow-dashboard/0 [idle] waiting: Waiting for kubeflow-profiles relation data
================================================================================ short test summary info ================================================================================
FAILED tests/integration/test_charm.py::test_add_profile_relation - juju.errors.JujuUnitError: Unit in error: kubeflow-profiles/0

juju show-status-log -i kubeflow-profiles/0

Time                        Type       Status       Message
02 Feb 2023 12:48:12-05:00  juju-unit  allocating   
02 Feb 2023 12:48:12-05:00  workload   waiting      installing agent
02 Feb 2023 12:48:22-05:00  workload   waiting      agent initializing
02 Feb 2023 12:48:28-05:00  workload   maintenance  installing charm software
02 Feb 2023 12:48:28-05:00  juju-unit  executing    running install hook
02 Feb 2023 12:48:29-05:00  workload   waiting      Waiting to connect to Profiles container
02 Feb 2023 12:48:29-05:00  workload   waiting      Waiting to connect to kfam container
02 Feb 2023 12:48:29-05:00  workload   maintenance  Creating K8S resources
02 Feb 2023 12:48:29-05:00  workload   maintenance  K8S resources created
02 Feb 2023 12:48:29-05:00  workload   active       
02 Feb 2023 12:48:30-05:00  juju-unit  executing    running kubeflow-profiles-relation-created hook
02 Feb 2023 12:48:30-05:00  workload   waiting      Waiting to connect to Profiles container
02 Feb 2023 12:48:30-05:00  workload   waiting      Waiting to connect to kfam container
02 Feb 2023 12:48:30-05:00  workload   maintenance  Creating K8S resources
02 Feb 2023 12:48:30-05:00  workload   maintenance  K8S resources created
02 Feb 2023 12:48:31-05:00  juju-unit  error        hook failed: "kubeflow-profiles-relation-created"
02 Feb 2023 12:48:36-05:00  juju-unit  executing    running kubeflow-profiles-relation-created hook
02 Feb 2023 12:48:36-05:00  workload   waiting      Waiting to connect to Profiles container
02 Feb 2023 12:48:36-05:00  workload   waiting      Waiting to connect to kfam container
02 Feb 2023 12:48:36-05:00  workload   maintenance  Creating K8S resources
02 Feb 2023 12:48:36-05:00  workload   maintenance  K8S resources created
02 Feb 2023 12:48:36-05:00  workload   active       
02 Feb 2023 12:48:37-05:00  juju-unit  executing    running leader-elected hook
02 Feb 2023 12:48:37-05:00  workload   waiting      Waiting to connect to Profiles container
02 Feb 2023 12:48:37-05:00  workload   waiting      Waiting to connect to kfam container
02 Feb 2023 12:48:37-05:00  workload   maintenance  Creating K8S resources
02 Feb 2023 12:48:38-05:00  workload   maintenance  K8S resources created
02 Feb 2023 12:48:38-05:00  workload   active       
02 Feb 2023 12:48:38-05:00  workload   waiting      Waiting to connect to Profiles container
02 Feb 2023 12:48:38-05:00  workload   waiting      Waiting to connect to kfam container
02 Feb 2023 12:48:38-05:00  workload   maintenance  Creating K8S resources
02 Feb 2023 12:48:38-05:00  workload   maintenance  K8S resources created
02 Feb 2023 12:48:38-05:00  workload   active       
02 Feb 2023 12:48:39-05:00  juju-unit  executing    running kubeflow-kfam-pebble-ready hook
02 Feb 2023 12:48:39-05:00  workload   waiting      Waiting to connect to Profiles container
02 Feb 2023 12:48:39-05:00  workload   waiting      Waiting to connect to kfam container
02 Feb 2023 12:48:39-05:00  workload   maintenance  Creating K8S resources
02 Feb 2023 12:48:40-05:00  workload   maintenance  K8S resources created

juju debug-log -i kubeflow-profiles/0 at that moment:

unit-kubeflow-profiles-0: 12:48:30 INFO unit.kubeflow-profiles/0.juju-log kubeflow-profiles:0: Rendering manifests
unit-kubeflow-profiles-0: 12:48:30 INFO unit.kubeflow-profiles/0.juju-log kubeflow-profiles:0: Reconcile completed successfully
unit-kubeflow-profiles-0: 12:48:30 ERROR unit.kubeflow-profiles/0.juju-log kubeflow-profiles:0: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 311, in _get_interfaces
    interfaces = get_interfaces(self)
  File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/serialized_data_interface/sdi.py", line 344, in get_interfaces
    return {
  File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/serialized_data_interface/sdi.py", line 345, in <dictcomp>
    endpoint: get_interface(charm, endpoint)
  File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/serialized_data_interface/sdi.py", line 381, in get_interface
    instance.get_data()
  File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/serialized_data_interface/sdi.py", line 171, in get_data
    rel_data = self.unwrap(relation)
  File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/serialized_data_interface/sdi.py", line 262, in unwrap
    version = self.get_version(relation)
  File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/serialized_data_interface/sdi.py", line 110, in get_version
    raise errors.UnversionedRelation(relation)
serialized_data_interface.errors.UnversionedRelation: List of <ops.model.Relation kubeflow-profiles:0> versions not found for apps: kubeflow-dashboard

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/charm.py", line 471, in <module>
    main(KubeflowProfilesOperator)
  File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/ops/main.py", line 436, in main
    framework.reemit()
  File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/ops/framework.py", line 866, in reemit
    self._reemit()
  File "/var/lib/juju/agents/unit-kubeflow-profiles-0/charm/venv/ops/framework.py", line 931, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 447, in main
    interfaces = self._get_interfaces()
  File "./src/charm.py", line 313, in _get_interfaces
    raise CheckFailed(err, WaitingStatus)
CheckFailed
unit-kubeflow-profiles-0: 12:48:31 ERROR juju.worker.uniter.operation hook "kubeflow-profiles-relation-created" (via hook dispatching script: dispatch) failed: exit status 1

I'm not sure why, but for some reason while the charm does go into Error state it later recovers. I wouldn't think this is possible, but it consistently happens. This means that the charm probably works in real deployments, but will often fail in CI for any CI that asserts charms should not go into error state

Proposed solution

while CheckFailed and ErrorWithStatus are equivalent in function, we can't mix and match them. Refactor to always use the ErrorWithStatus from chisme and this wont be an issue

ca-scribner commented 1 year ago

This feels like it should have been caught by a unit test. We should also check our test coverage to decide if there's a unit test we need to add

ca-scribner commented 1 year ago

Fixed by #83