Bump juju 3.1 -> 3.5 - Githubissues

DnPlas commented 5 months ago

Context

According to the Juju roadmap&releases page, juju 3.1 support stops at the end of April 2024. The next supported version is 3.4, for which bug fixes support ends in April 2024 and the security fix support ends on July 2024. Because of this and to provide better support of features in CKF, charms have to be tested with this version.

NOTE: while juju 3.5 release is close, there are some features and user stories that need this bump. For instance canonical/istio-operators#398. After 3.5 is released, the team has to go through this process again.

What needs to get done

Test the CKF 1.8/stable bundle works well with juju 3.4 using the UATs - CKF 1.7 only supports 2.9, it doesn't have to be tested.
Bump the juju version in the CI (both controller and client)
Bump charm and testing framework dependencies (ops, python-libjuju, etc.)
Provide an upgrade path from 2.9 (supported in CKF 1.7) to 3.4
(potential) Update any test that needs to be updated because of this change

Merge all of:

Definition of Done

The bundle is tested with juju 3.4
The CI uses juju 3.4

syncronize-issues-to-jira[bot] commented 5 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5503.

This message was autogenerated

DnPlas commented 5 months ago

Initial tests show that:

CKF 1.8/stable deployed using a juju 3.4 controller and client goes into active and idle without any major incident. I tested in this environment:

$ juju controllers
Use --refresh option with this command to see the latest information.

Controller  Model     User   Access     Cloud/Region        Models  Nodes  HA  Version
uk8s*       kubeflow  admin  superuser  microk8s/localhost       2      1   -  3.4.1  

$ juju --version
3.4.1-genericlinux-amd64

Running the UATs was not an issue, though they do not provide a lot of information about the compatibility with the controller, almost all ran successfully, except for one, which ended with the following message:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kubeflow-katib 0.15.0 requires grpcio==1.41.1, but you have grpcio 1.51.3 which is incompatible.
kubeflow-katib 0.15.0 requires protobuf==3.19.5, but you have protobuf 3.20.3 which is incompatible.
kfp 2.4.0 requires kubernetes<27,>=8.0.0, but you have kubernetes 28.1.0 which is incompatible.
------------------------------ Captured log call -------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running training-integration.ipynb...
=========================== short test summary info ============================
FAILED test_notebooks.py::test_notebook[e2e-wine-kfp-mlflow-seldon] - Failed:...
FAILED test_notebooks.py::test_notebook[katib-integration] - Failed: AssertionError: Katib Experiment was not successful.
FAILED test_notebooks.py::test_notebook[mlflow-integration] - Failed: Noteboo...
FAILED test_notebooks.py::test_notebook[mlflow-kserve] - Failed: Notebook exe...
FAILED test_notebooks.py::test_notebook[mlflow-minio-integration] - Failed: N...
FAILED test_notebooks.py::test_notebook[training-integration] - Failed: Noteb...
...
File "/home/ubuntu/shared/charmed-kubeflow-uats/driver/test_kubeflow_workloads.py", line 130, in test_kubeflow_workloads
    pytest.fail(
  File "/home/ubuntu/shared/charmed-kubeflow-uats/.tox/uats-remote/lib/python3.10/site-packages/_pytest/outcomes.py", line 198, in fail
    raise Failed(msg=reason, pytrace=pytrace)
Failed: Something went wrong while running Job test-kubeflow/test-kubeflow. Please inspect the attached logs for more info...
...
E           RuntimeError: Failed to read logs for pod test-kubeflow/paddle-simple-cpu-worker-0
E           RuntimeError: Failed to read logs for pod test-kubeflow/paddle-simple-cpu-worker-0

The message does not seem to be related to the controller, and though we should look into it, we can discard this error as a blocker for bumping the juju version. I will create an issue on canonical/charmed-kubeflow-uats to follow up.

DnPlas commented 5 months ago

One of the limitations that I have found is that juju 3.4 seems to not handle correctly pod spec charms' unit statuses. While most of CKF charms follow the sidecar pattern, some of them are still in podspec (like oidc-gatekeeper and kubeflow-volumes). The behaviour I am observing is shown here (with kubeflow-volumes):

$ juju status
Model       Controller  Cloud/Region        Version  SLA          Timestamp
test-istio  uk8s        microk8s/localhost  3.4.1    unsupported  12:26:36Z

App                     Version                Status   Scale  Charm                   Channel      Rev  Address         Exposed  Message
istio-ingressgateway                           active       1  istio-gateway                          0  10.152.183.62   no
istio-pilot                                    active       1  istio-pilot                            0  10.152.183.246  no
kubeflow-volumes        res:oci-image@2261827  waiting      1  kubeflow-volumes        1.8/stable   260                  no       waiting for container
tensorboard-controller                         active       1  tensorboard-controller  latest/edge  266  10.152.183.108  no

Unit                       Workload  Agent  Address      Ports  Message
istio-ingressgateway/0*    active    idle   10.1.60.139
istio-pilot/0*             active    idle   10.1.60.138
kubeflow-volumes/0*        waiting   idle                       waiting for container # <--- waiting for container, but container is running
tensorboard-controller/0*  active    idle   10.1.60.143

$ kubectl get pods -A | grep volumes
test-istio        kubeflow-volumes-operator-0                      1/1     Running   0          23m

Interestingly enough, this is not the case for oidc-gatekeeper when deploying using juju deploy but it is the case when deploying with model.deploy (e.g. from a test case).

This affects integration tests as they timeout waiting for all units to go to Active status.

DnPlas commented 5 months ago

I did a couple more tests and here are my findings:

oidc-gatekeeper did not present any trouble after upgrading juju and other deps. In https://github.com/canonical/oidc-gatekeeper-operator/pull/141 we can observe that all the tests passed and the unit was active and idle.
For kubeflow-volumes I have sent two PRs for testing the behaviour https://github.com/canonical/kubeflow-volumes-operator/pull/130 and https://github.com/canonical/kubeflow-volumes-operator/pull/129
The CI where I am seeing the behaviour from my previous message is for istio-operators. In this CI we are deploying kubeflow-volumes and oidc-gatekeeper. Here is the last run. I tried reproducing this issue in my local machine and was able to. After inspecting the logs from each of the failing charms, I get the following:

2024-04-04 09:38:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "ingress-auth-relation-changed" hook (via hook dispatching script: dispatch)
2024-04-04 09:38:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:38:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:42:11 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:42:12 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:42:12 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:47:20 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:47:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:47:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:51:43 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:51:43 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:51:43 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:56:52 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:56:52 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:56:52 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []

I am currently talking to the juju team about it.

DnPlas commented 5 months ago

Update about oidc-gatekeeper and kubeflow-volumes:

When bumping each charm's CIs in both main and track/<version>, all the tests pass and succeed, which now makes it look like the istio-operators integration tests is the actual cause of the issue. To unblock that CI, I have tried swapping kubeflow-volumes with tensorboards-web-app as the ingress requirer application, just to see if there is a difference. In the long run, and to avoid having to deploy charms that change so often, we should have a generic ingress requirer charm that assists in checking the ingress relation, but doesn't actually perform anything.

DnPlas commented 5 months ago

After closer inspection to each of the failing CIs, it looks like the charms that were deployed in the istio-operators CI were really outdated and had an ops version < 2.x, causing some collisions with juju 3.4. https://github.com/canonical/istio-operators/pull/405 should fix the issues in that repo's CI. For the rest of the repositories, I don't seem to be finding issues, but I'll keep an eye to catch the places where the charm version is outdated.

DnPlas commented 5 months ago

I think this effort is big enough to be split in smaller tasks and we should definitely involve more people from the team as at the moment there are some changes that have to happen manually. The way I think this task can be completed is by doing the following:

Bump all versions - Do a cannon run to bump all the instances of juju-channel in all .github/workflows/integrate.yaml across repositories. At the same time, bump the versions of ops, pytest-operator, and python-libjuju.
- The resulting PR should be something like https://github.com/canonical/oidc-gatekeeper-operator/pull/142
- This change has to be made both in gh:track/ and gh:main
Pin charm dependencies (kind of optional) - All the charms that get deployed as dependencies in integration tests must be pinned to their corresponding 1.8 stable channels in the gh:track/ branches. For instance, istio-operators deploy kubeflow-volumes in their integration tests, we must ensure that the last supported stable version of kubeflow-volumes get deployed alongside the last supported stable istio-operators.
- The resulting PR should be something like https://github.com/canonical/istio-operators/pull/405
- This is kind of optional because while some of the integration tests may pass, the correct way of running integration tests to keep repeatable runs is by testing always with the same versions.
- This change affects gh:track/, but it might be the case we want to test with latest/edge in main, so some changes must be done.
- Related issue https://github.com/canonical/bundle-kubeflow/issues/857
- NOTE: this effort is huge if done manually and could be automated. There is a proposal https://warthogs.atlassian.net/browse/PF-4580 for automating it, but if we opt to go with the manual process because of customer times, we may need to assign ~5 charms to each eng of the team and work on it.
Promote to stable - once all of the necessary changes are merged, all of our 30+ charms have to be promoted from /edge to /stable.
- This is a manual process: go to repo → go to actions → run the promote action for each charm in the repo. If we opt to go with the manual process because of customer times, we may need to assign ~5 charms to each eng of the team and work on it.
- We could add a workflow dispatch that promotes all charms, but this will add more work to the task. I suggest we do, as we will need this in the future.
Manual testing - to ensure every charm can be deployed individually and as a bundle with juju 3.4.

DnPlas commented 5 months ago

After closer inspection to each of the failing CIs, it looks like the charms that were deployed in the istio-operators CI were really outdated and had an ops version < 2.x, causing some collisions with juju 3.4. canonical/istio-operators#405 should fix the issues in that repo's CI. For the rest of the repositories, I don't seem to be finding issues, but I'll keep an eye to catch the places where the charm version is outdated.

Related to: #857

While this change fixed the problem for some charms in the istio-operators CI, it did not solve the problem entirely. At first glance it looks like podspec charms deployed from Charmhub are having some trouble, it is the case for kubeflow-volumes. There is an ongoing conversation with the juju team here.

DnPlas commented 4 months ago

Because juju 3.5 was available sooner than 3.4, the team has decided to go with that version instead. The work for this is not affected.

DnPlas commented 3 months ago

Since all of the github CIs are now running juju 3.5, we can close this issue.

canonical / bundle-kubeflow

Bump juju 3.1 -> 3.5 #859

Context

What needs to get done

Definition of Done