canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
97 stars 47 forks source link

Define supported Juju version for CKF 1.8 and 1.9 #940

Closed DnPlas closed 1 week ago

DnPlas commented 2 weeks ago

Context

In the past months, the team has hit different issues with certain juju versions. As we are approaching the CKF 1.9 release and to provide better support for users and customers, we have to define which version of juju we'll be using for testing and eventually communicate it as the supported one.

What needs to get done

  1. Look into issues that are affecting the bundle, both on 3.5 and 3.4
  2. Perform some testing to identify more potential issues
  3. Decide which version of juju to go with
  4. Based on 3, we may need to change the CI of all repositories if we decide to go with juju 3.4 (as we are using 3.5 now)
  5. Create a discourse post or some sort of communication to let everybody know which version is supported
  6. Update the Supported versions entry in public docs

Definition of Done

There is a communication with the juju version to use and it is guaranteed that it works.

syncronize-issues-to-jira[bot] commented 2 weeks ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5887.

This message was autogenerated

DnPlas commented 2 weeks ago

Identified issues

Juju 3.4.2

Juju 3.5.0

Juju 3.5.1

Workarounds

  1. Because our CI is using juju 3.5/stable, we had to pin the agent version to 3.5.0. This workaround works as long as we DO NOT integrate rocks. See https://github.com/canonical/notebook-operators/pull/374 and https://github.com/canonical/kfp-operators/pull/499 for reference.
  2. We had to revert the oci-image of the kubeflow-dashboard to use the upstream image instead of the rock, see https://github.com/canonical/kubeflow-dashboard-operator/pull/191 and https://github.com/canonical/kubeflow-dashboard-operator/pull/192 for reference
  3. The mysql-k8s charm has applied more strict assumes rules for avoiding using the buggy versions, see https://github.com/canonical/mysql-k8s-operator/pull/431 for reference
DnPlas commented 2 weeks ago

Tests

Juju 3.4.3

I have deployed CKF 1.8/stable and I can confirm that https://bugs.launchpad.net/juju/+bug/2060943 and none of the issues present in 3.5.x are happening.

Likewise, I deployed CKF latest/edge and it seems like there are no issues.

Juju 3.5.1

I have deployed CKF 1.8/stable and as expected, some charms are failing with the mentioned errors. I was only able to run on this version because 3.5.2 hasn't been released.

In this case latest/edge will also fail.

Resolution options

The majority of the issues should be solved once juju 3.5.2 is released, which will potentially happen on the week of June 24th. Historically, the release has delayed in a few occassions (e.g. 3.4.3 delayed for more than a month), though. That being said,

Option 1: use juju 3.4.3 in all CI

With this option we could guarantee some stability as juju 3.4 has been around for a while and we know it works for deploying CKF latest/edge and 1.8/stable.

Pros:

Cons:

Option 2: use juju 3.5/stable

With this option we guarantee the juju version is newer and we are constantly testing it.

Pros:

Cons:

DnPlas commented 1 week ago

Conclusion

  1. If mysql-k8s is bumped before juju 3.5.2 is released, CKF is not going to work. This is going to be a release blocker.
  2. We do not know if 3.5.2 is going to be introducing new issues.
  3. We should avoid using juju "edge"
  4. It looks like we only have option 1
  5. We should ask juju to test our product before releasing to avoid breaking
  6. The team will be using 3.4/stable instead of 3.5/stable

Work from here

  1. Change the juju version to 3.4/stable in all CIs
  2. Let's have a testing plan with SQA for testing future versions of juju
  3. https://github.com/canonical/bundle-kubeflow/issues/942 and https://github.com/canonical/bundle-kubeflow/issues/927- we can close these issue, and we have to update those scripts to download the 3.4 binary.
  4. Generate a list of files that we have to update (for future reference)
DnPlas commented 1 week ago

Closing this issue as the decision has been made, #944 will be used for tracking changes.