canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
102 stars 49 forks source link

Bump juju 3.1 -> 3.5 #859

Closed DnPlas closed 3 months ago

DnPlas commented 5 months ago

Context

According to the Juju roadmap&releases page, juju 3.1 support stops at the end of April 2024. The next supported version is 3.4, for which bug fixes support ends in April 2024 and the security fix support ends on July 2024. Because of this and to provide better support of features in CKF, charms have to be tested with this version.

NOTE: while juju 3.5 release is close, there are some features and user stories that need this bump. For instance canonical/istio-operators#398. After 3.5 is released, the team has to go through this process again.

What needs to get done

  1. Test the CKF 1.8/stable bundle works well with juju 3.4 using the UATs - CKF 1.7 only supports 2.9, it doesn't have to be tested.
  2. Bump the juju version in the CI (both controller and client)
  3. Bump charm and testing framework dependencies (ops, python-libjuju, etc.)
  4. Provide an upgrade path from 2.9 (supported in CKF 1.7) to 3.4
  5. (potential) Update any test that needs to be updated because of this change

Merge all of:

Definition of Done

syncronize-issues-to-jira[bot] commented 5 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5503.

This message was autogenerated

DnPlas commented 5 months ago

Initial tests show that:

$ juju controllers
Use --refresh option with this command to see the latest information.

Controller  Model     User   Access     Cloud/Region        Models  Nodes  HA  Version
uk8s*       kubeflow  admin  superuser  microk8s/localhost       2      1   -  3.4.1  

$ juju --version
3.4.1-genericlinux-amd64
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kubeflow-katib 0.15.0 requires grpcio==1.41.1, but you have grpcio 1.51.3 which is incompatible.
kubeflow-katib 0.15.0 requires protobuf==3.19.5, but you have protobuf 3.20.3 which is incompatible.
kfp 2.4.0 requires kubernetes<27,>=8.0.0, but you have kubernetes 28.1.0 which is incompatible.
------------------------------ Captured log call -------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running training-integration.ipynb...
=========================== short test summary info ============================
FAILED test_notebooks.py::test_notebook[e2e-wine-kfp-mlflow-seldon] - Failed:...
FAILED test_notebooks.py::test_notebook[katib-integration] - Failed: AssertionError: Katib Experiment was not successful.
FAILED test_notebooks.py::test_notebook[mlflow-integration] - Failed: Noteboo...
FAILED test_notebooks.py::test_notebook[mlflow-kserve] - Failed: Notebook exe...
FAILED test_notebooks.py::test_notebook[mlflow-minio-integration] - Failed: N...
FAILED test_notebooks.py::test_notebook[training-integration] - Failed: Noteb...
...
File "/home/ubuntu/shared/charmed-kubeflow-uats/driver/test_kubeflow_workloads.py", line 130, in test_kubeflow_workloads
    pytest.fail(
  File "/home/ubuntu/shared/charmed-kubeflow-uats/.tox/uats-remote/lib/python3.10/site-packages/_pytest/outcomes.py", line 198, in fail
    raise Failed(msg=reason, pytrace=pytrace)
Failed: Something went wrong while running Job test-kubeflow/test-kubeflow. Please inspect the attached logs for more info...
...
E           RuntimeError: Failed to read logs for pod test-kubeflow/paddle-simple-cpu-worker-0
E           RuntimeError: Failed to read logs for pod test-kubeflow/paddle-simple-cpu-worker-0

The message does not seem to be related to the controller, and though we should look into it, we can discard this error as a blocker for bumping the juju version. I will create an issue on canonical/charmed-kubeflow-uats to follow up.

DnPlas commented 5 months ago

One of the limitations that I have found is that juju 3.4 seems to not handle correctly pod spec charms' unit statuses. While most of CKF charms follow the sidecar pattern, some of them are still in podspec (like oidc-gatekeeper and kubeflow-volumes). The behaviour I am observing is shown here (with kubeflow-volumes):

$ juju status
Model       Controller  Cloud/Region        Version  SLA          Timestamp
test-istio  uk8s        microk8s/localhost  3.4.1    unsupported  12:26:36Z

App                     Version                Status   Scale  Charm                   Channel      Rev  Address         Exposed  Message
istio-ingressgateway                           active       1  istio-gateway                          0  10.152.183.62   no
istio-pilot                                    active       1  istio-pilot                            0  10.152.183.246  no
kubeflow-volumes        res:oci-image@2261827  waiting      1  kubeflow-volumes        1.8/stable   260                  no       waiting for container
tensorboard-controller                         active       1  tensorboard-controller  latest/edge  266  10.152.183.108  no

Unit                       Workload  Agent  Address      Ports  Message
istio-ingressgateway/0*    active    idle   10.1.60.139
istio-pilot/0*             active    idle   10.1.60.138
kubeflow-volumes/0*        waiting   idle                       waiting for container # <--- waiting for container, but container is running
tensorboard-controller/0*  active    idle   10.1.60.143

$ kubectl get pods -A | grep volumes
test-istio        kubeflow-volumes-operator-0                      1/1     Running   0          23m

Interestingly enough, this is not the case for oidc-gatekeeper when deploying using juju deploy but it is the case when deploying with model.deploy (e.g. from a test case).

This affects integration tests as they timeout waiting for all units to go to Active status.

DnPlas commented 5 months ago

I did a couple more tests and here are my findings:

2024-04-04 09:38:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "ingress-auth-relation-changed" hook (via hook dispatching script: dispatch)
2024-04-04 09:38:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:38:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:42:11 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:42:12 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:42:12 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:47:20 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:47:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:47:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:51:43 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:51:43 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:51:43 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:56:52 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:56:52 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:56:52 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []

I am currently talking to the juju team about it.

DnPlas commented 5 months ago

Update about oidc-gatekeeper and kubeflow-volumes:

When bumping each charm's CIs in both main and track/<version>, all the tests pass and succeed, which now makes it look like the istio-operators integration tests is the actual cause of the issue. To unblock that CI, I have tried swapping kubeflow-volumes with tensorboards-web-app as the ingress requirer application, just to see if there is a difference. In the long run, and to avoid having to deploy charms that change so often, we should have a generic ingress requirer charm that assists in checking the ingress relation, but doesn't actually perform anything.

DnPlas commented 5 months ago

After closer inspection to each of the failing CIs, it looks like the charms that were deployed in the istio-operators CI were really outdated and had an ops version < 2.x, causing some collisions with juju 3.4. https://github.com/canonical/istio-operators/pull/405 should fix the issues in that repo's CI. For the rest of the repositories, I don't seem to be finding issues, but I'll keep an eye to catch the places where the charm version is outdated.

Related to: https://github.com/canonical/bundle-kubeflow/issues/857

DnPlas commented 5 months ago

I think this effort is big enough to be split in smaller tasks and we should definitely involve more people from the team as at the moment there are some changes that have to happen manually. The way I think this task can be completed is by doing the following:

  1. Bump all versions - Do a cannon run to bump all the instances of juju-channel in all .github/workflows/integrate.yaml across repositories. At the same time, bump the versions of ops, pytest-operator, and python-libjuju.

  2. Pin charm dependencies (kind of optional) - All the charms that get deployed as dependencies in integration tests must be pinned to their corresponding 1.8 stable channels in the gh:track/ branches. For instance, istio-operators deploy kubeflow-volumes in their integration tests, we must ensure that the last supported stable version of kubeflow-volumes get deployed alongside the last supported stable istio-operators.

    • The resulting PR should be something like https://github.com/canonical/istio-operators/pull/405

    • This is kind of optional because while some of the integration tests may pass, the correct way of running integration tests to keep repeatable runs is by testing always with the same versions.

    • This change affects gh:track/, but it might be the case we want to test with latest/edge in main, so some changes must be done.

    • Related issue https://github.com/canonical/bundle-kubeflow/issues/857

    • NOTE: this effort is huge if done manually and could be automated. There is a proposal https://warthogs.atlassian.net/browse/PF-4580 for automating it, but if we opt to go with the manual process because of customer times, we may need to assign ~5 charms to each eng of the team and work on it.

  3. Promote to stable - once all of the necessary changes are merged, all of our 30+ charms have to be promoted from /edge to /stable.

    • This is a manual process: go to repo → go to actions → run the promote action for each charm in the repo. If we opt to go with the manual process because of customer times, we may need to assign ~5 charms to each eng of the team and work on it.

    • We could add a workflow dispatch that promotes all charms, but this will add more work to the task. I suggest we do, as we will need this in the future.

  4. Manual testing - to ensure every charm can be deployed individually and as a bundle with juju 3.4.

DnPlas commented 5 months ago

After closer inspection to each of the failing CIs, it looks like the charms that were deployed in the istio-operators CI were really outdated and had an ops version < 2.x, causing some collisions with juju 3.4. canonical/istio-operators#405 should fix the issues in that repo's CI. For the rest of the repositories, I don't seem to be finding issues, but I'll keep an eye to catch the places where the charm version is outdated.

Related to: #857

While this change fixed the problem for some charms in the istio-operators CI, it did not solve the problem entirely. At first glance it looks like podspec charms deployed from Charmhub are having some trouble, it is the case for kubeflow-volumes. There is an ongoing conversation with the juju team here.

DnPlas commented 4 months ago

Because juju 3.5 was available sooner than 3.4, the team has decided to go with that version instead. The work for this is not affected.

DnPlas commented 3 months ago

Since all of the github CIs are now running juju 3.5, we can close this issue.