Closed DnPlas closed 3 months ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5503.
This message was autogenerated
Initial tests show that:
$ juju controllers
Use --refresh option with this command to see the latest information.
Controller Model User Access Cloud/Region Models Nodes HA Version
uk8s* kubeflow admin superuser microk8s/localhost 2 1 - 3.4.1
$ juju --version
3.4.1-genericlinux-amd64
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kubeflow-katib 0.15.0 requires grpcio==1.41.1, but you have grpcio 1.51.3 which is incompatible.
kubeflow-katib 0.15.0 requires protobuf==3.19.5, but you have protobuf 3.20.3 which is incompatible.
kfp 2.4.0 requires kubernetes<27,>=8.0.0, but you have kubernetes 28.1.0 which is incompatible.
------------------------------ Captured log call -------------------------------
INFO test_notebooks:test_notebooks.py:44 Running training-integration.ipynb...
=========================== short test summary info ============================
FAILED test_notebooks.py::test_notebook[e2e-wine-kfp-mlflow-seldon] - Failed:...
FAILED test_notebooks.py::test_notebook[katib-integration] - Failed: AssertionError: Katib Experiment was not successful.
FAILED test_notebooks.py::test_notebook[mlflow-integration] - Failed: Noteboo...
FAILED test_notebooks.py::test_notebook[mlflow-kserve] - Failed: Notebook exe...
FAILED test_notebooks.py::test_notebook[mlflow-minio-integration] - Failed: N...
FAILED test_notebooks.py::test_notebook[training-integration] - Failed: Noteb...
...
File "/home/ubuntu/shared/charmed-kubeflow-uats/driver/test_kubeflow_workloads.py", line 130, in test_kubeflow_workloads
pytest.fail(
File "/home/ubuntu/shared/charmed-kubeflow-uats/.tox/uats-remote/lib/python3.10/site-packages/_pytest/outcomes.py", line 198, in fail
raise Failed(msg=reason, pytrace=pytrace)
Failed: Something went wrong while running Job test-kubeflow/test-kubeflow. Please inspect the attached logs for more info...
...
E RuntimeError: Failed to read logs for pod test-kubeflow/paddle-simple-cpu-worker-0
E RuntimeError: Failed to read logs for pod test-kubeflow/paddle-simple-cpu-worker-0
The message does not seem to be related to the controller, and though we should look into it, we can discard this error as a blocker for bumping the juju version. I will create an issue on canonical/charmed-kubeflow-uats to follow up.
One of the limitations that I have found is that juju 3.4 seems to not handle correctly pod spec charms' unit statuses. While most of CKF charms follow the sidecar pattern, some of them are still in podspec (like oidc-gatekeeper
and kubeflow-volumes
). The behaviour I am observing is shown here (with kubeflow-volumes):
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
test-istio uk8s microk8s/localhost 3.4.1 unsupported 12:26:36Z
App Version Status Scale Charm Channel Rev Address Exposed Message
istio-ingressgateway active 1 istio-gateway 0 10.152.183.62 no
istio-pilot active 1 istio-pilot 0 10.152.183.246 no
kubeflow-volumes res:oci-image@2261827 waiting 1 kubeflow-volumes 1.8/stable 260 no waiting for container
tensorboard-controller active 1 tensorboard-controller latest/edge 266 10.152.183.108 no
Unit Workload Agent Address Ports Message
istio-ingressgateway/0* active idle 10.1.60.139
istio-pilot/0* active idle 10.1.60.138
kubeflow-volumes/0* waiting idle waiting for container # <--- waiting for container, but container is running
tensorboard-controller/0* active idle 10.1.60.143
$ kubectl get pods -A | grep volumes
test-istio kubeflow-volumes-operator-0 1/1 Running 0 23m
Interestingly enough, this is not the case for oidc-gatekeeper
when deploying using juju deploy
but it is the case when deploying with model.deploy
(e.g. from a test case).
This affects integration tests as they timeout waiting for all units to go to Active status.
I did a couple more tests and here are my findings:
oidc-gatekeeper
did not present any trouble after upgrading juju and other deps. In https://github.com/canonical/oidc-gatekeeper-operator/pull/141 we can observe that all the tests passed and the unit was active and idle.kubeflow-volumes
I have sent two PRs for testing the behaviour https://github.com/canonical/kubeflow-volumes-operator/pull/130 and https://github.com/canonical/kubeflow-volumes-operator/pull/129istio-operators
. In this CI we are deploying kubeflow-volumes
and oidc-gatekeeper
. Here is the last run. I tried reproducing this issue in my local machine and was able to. After inspecting the logs from each of the failing charms, I get the following:2024-04-04 09:38:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "ingress-auth-relation-changed" hook (via hook dispatching script: dispatch)
2024-04-04 09:38:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:38:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:42:11 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:42:12 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:42:12 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:47:20 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:47:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:47:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:51:43 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:51:43 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:51:43 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:56:52 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:56:52 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:56:52 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
I am currently talking to the juju team about it.
Update about oidc-gatekeeper
and kubeflow-volumes
:
When bumping each charm's CIs in both main
and track/<version>
, all the tests pass and succeed, which now makes it look like the istio-operators
integration tests is the actual cause of the issue. To unblock that CI, I have tried swapping kubeflow-volumes
with tensorboards-web-app
as the ingress requirer application, just to see if there is a difference. In the long run, and to avoid having to deploy charms that change so often, we should have a generic ingress requirer charm that assists in checking the ingress relation, but doesn't actually perform anything.
After closer inspection to each of the failing CIs, it looks like the charms that were deployed in the istio-operators
CI were really outdated and had an ops
version < 2.x, causing some collisions with juju 3.4. https://github.com/canonical/istio-operators/pull/405 should fix the issues in that repo's CI. For the rest of the repositories, I don't seem to be finding issues, but I'll keep an eye to catch the places where the charm version is outdated.
Related to: https://github.com/canonical/bundle-kubeflow/issues/857
I think this effort is big enough to be split in smaller tasks and we should definitely involve more people from the team as at the moment there are some changes that have to happen manually. The way I think this task can be completed is by doing the following:
Bump all versions - Do a cannon run to bump all the instances of juju-channel in all .github/workflows/integrate.yaml across repositories. At the same time, bump the versions of ops, pytest-operator, and python-libjuju.
The resulting PR should be something like https://github.com/canonical/oidc-gatekeeper-operator/pull/142
This change has to be made both in gh:track/
Pin charm dependencies (kind of optional) - All the charms that get deployed as dependencies in integration tests must be pinned to their corresponding 1.8 stable channels in the gh:track/
The resulting PR should be something like https://github.com/canonical/istio-operators/pull/405
This is kind of optional because while some of the integration tests may pass, the correct way of running integration tests to keep repeatable runs is by testing always with the same versions.
This change affects gh:track/
Related issue https://github.com/canonical/bundle-kubeflow/issues/857
NOTE: this effort is huge if done manually and could be automated. There is a proposal https://warthogs.atlassian.net/browse/PF-4580 for automating it, but if we opt to go with the manual process because of customer times, we may need to assign ~5 charms to each eng of the team and work on it.
Promote to stable - once all of the necessary changes are merged, all of our 30+ charms have to be promoted from
This is a manual process: go to repo → go to actions → run the promote action for each charm in the repo. If we opt to go with the manual process because of customer times, we may need to assign ~5 charms to each eng of the team and work on it.
We could add a workflow dispatch that promotes all charms, but this will add more work to the task. I suggest we do, as we will need this in the future.
Manual testing - to ensure every charm can be deployed individually and as a bundle with juju 3.4.
After closer inspection to each of the failing CIs, it looks like the charms that were deployed in the
istio-operators
CI were really outdated and had anops
version < 2.x, causing some collisions with juju 3.4. canonical/istio-operators#405 should fix the issues in that repo's CI. For the rest of the repositories, I don't seem to be finding issues, but I'll keep an eye to catch the places where the charm version is outdated.Related to: #857
While this change fixed the problem for some charms in the istio-operators
CI, it did not solve the problem entirely. At first glance it looks like podspec charms deployed from Charmhub are having some trouble, it is the case for kubeflow-volumes
.
There is an ongoing conversation with the juju team here.
Because juju 3.5
was available sooner than 3.4
, the team has decided to go with that version instead. The work for this is not affected.
Since all of the github CIs are now running juju 3.5, we can close this issue.
Context
According to the Juju roadmap&releases page, juju 3.1 support stops at the end of April 2024. The next supported version is 3.4, for which bug fixes support ends in April 2024 and the security fix support ends on July 2024. Because of this and to provide better support of features in CKF, charms have to be tested with this version.
NOTE: while juju 3.5 release is close, there are some features and user stories that need this bump. For instance canonical/istio-operators#398. After 3.5 is released, the team has to go through this process again.
What needs to get done
Merge all of:
Definition of Done