canonical / training-operator

Kubeflow Training Operator
Apache License 2.0
4 stars 6 forks source link

training-operator failed to upgrade 1.6 to 1.7 #75

Closed i-chvets closed 1 year ago

i-chvets commented 1 year ago

Failed to reach active/idle:

training-operator blocked: K8S resources creation failed

Jira

Merge into:

DnPlas commented 1 year ago

Hi @i-chvets, do you mind adding steps to reproduce and some logs? Is this only failing when upgrading or on install as well?

DnPlas commented 1 year ago

I was able to get the same message: training-operator/0* blocked idle 10.1.216.35 K8S resources creation failed

Steps to reproduce

  1. juju deploy training-operator --channel 1.5/stable --trust #wait until it's active and idle
  2. juju refresh training-operator --channel latest/edge
  3. training-operator should be in BlockedStatus

Let's dig a bit deeper in the charm code to figure out this one.

beliaev-maksim commented 1 year ago

@DnPlas any issues on AMD ?

juju deploy training-operator --channel 1.6/stable --trust       
ERROR selecting releases: charm or bundle not found for channel "1.6/stable", platform "amd64"
available releases are:
  channel "stable": available series are: focal
  channel "candidate": available series are: focal
  channel "beta": available series are: focal
  channel "edge": available series are: focal
  channel "1.3/stable": available series are: focal
  channel "1.5/beta": available series are: focal
  channel "1.5/edge": available series are: focal
  channel "1.3/candidate": available series are: focal
  channel "1.3/beta": available series are: focal
  channel "1.3/edge": available series are: focal
  channel "1.5/stable": available series are: focal
  channel "1.5/candidate": available series are: focal
beliaev-maksim commented 1 year ago

also

juju refresh training-operator --channel latest/edge --trust
ERROR option provided but not defined: --trust

what is the juju version you use?

i-chvets commented 1 year ago

Upgrading training-operator from stable (Rev 6) to latest/edge (Rev 135):

juju deploy training-operator --channel stable --trust juju refresh training-operator --channel latest/edge

i-chvets commented 1 year ago

In KF 1.6 bundletraining-operator is in 1.5/stable channel (Rev 65). The proper testing procedure:

juju deploy training-operator --channel 1.5/stable --trust
juju refresh training-operator --channel latest/edge
i-chvets commented 1 year ago

To reproduce using local charm:

juju deploy ./training-operator_ubuntu-20.04-amd64.charm.old --resource="training-operator-image=kubeflow/training-operator:v1-8c6eab2" --trust
juju refresh training-operator --path=./training-operator_ubuntu-20.04-amd64.charm --resource="training-operator-image=kubeflow/training-operator:v1-27e5499"
Unit                  Workload  Agent  Address     Ports  Message
training-operator/0*  blocked   idle   10.1.59.79         ...: Apply failed with 1 conflict: conflict with "python-httpx" using rbac.authorization.k8s.io/v1: .rules
beliaev-maksim commented 1 year ago

@i-chvets it might be that podspec charm configs conflict with those that are pushed by Juju. Maybe during upgrade need to clean them up before installing sidecar ?

DnPlas commented 1 year ago

@i-chvets I am confused, these steps to reproduce, the error there, and the original description of this bug do not match. Could you please confirm what exactly is the issue you are running into and add steps to reproduce?

Also, please note that v1.5 and v1.6 of the training-operator are both sidecar charms, no need to involve any podspec charm.

i-chvets commented 1 year ago

@i-chvets it might be that podspec charm configs conflict with those that are pushed by Juju. Maybe during upgrade need to clean them up before installing sidecar ?

~~It looks like it is due to order of K8S manifests that are being applied during refresh. In some charms (Seldon, Training Operator) we separated CRD and K8S resources deployment. And in some cases CRD must be installed before other manifests. In other charms CRDs and other manifests are applied in the by the same resource handler that does sorting of the resources before it applies them. To solve the issue, we either handle order in charm, or we revert to using single K8S handler to do ordering for us. I am leaning towards putting everything back into single handler. I need to confirm this first.~~

The above was not the reason.

i-chvets commented 1 year ago

@i-chvets I am confused, these steps to reproduce, the error there, and the original description of this bug do not match. Could you please confirm what exactly is the issue you are running into and add steps to reproduce?

Also, please note that v1.5 and v1.6 of the training-operator are both sidecar charms, no need to involve any podspec charm.

I just reproduced using local charms. I don't think versions really matter in this case. I will confirm this. I will re-create steps with proper versions 1.5 and 1.6. I took PodSpec version to ensure that charm is old enough. Also fails with sidecar rev93 charm: refresh of local charms from rev93 to latest fails.

DnPlas commented 1 year ago

It looks like it is due to order of K8S manifests that are being applied during refresh. In some charms (Seldon, Training Operator) we separated CRD and K8S resources deployment. And in some cases CRD must be installed before other manifests. In other charms CRDs and other manifests are applied in the by the same resource handler that does sorting of the resources before it applies them. To solve the issue, we either handle order in charm, or we revert to using single K8S handler to do ordering for us. I am leaning towards putting everything back into single handler. I need to confirm this first.

@i-chvets before changing anything, we need to understand what is the issue you are running into. Please modify the description of this issue with logs, error messages and steps to reproduce. You have referenced two different error messages and two different set of steps to reproduce, and that makes it hard to understand.

i-chvets commented 1 year ago

It looks like it is due to order of K8S manifests that are being applied during refresh. In some charms (Seldon, Training Operator) we separated CRD and K8S resources deployment. And in some cases CRD must be installed before other manifests. In other charms CRDs and other manifests are applied in the by the same resource handler that does sorting of the resources before it applies them. To solve the issue, we either handle order in charm, or we revert to using single K8S handler to do ordering for us. I am leaning towards putting everything back into single handler. I need to confirm this first.

@i-chvets before changing anything, we need to understand what is the issue you are running into. Please modify the description of this issue with logs, error messages and steps to reproduce. You have referenced two different error messages and two different set of steps to reproduce, and that makes it hard to understand.

The error message that you see in local charm steps is the one that is buried deep in the framework. I just changed top level generic K8S resources creation failed message to actually report the error the framework gives us. Instructions are the same: refresh from one version to another given manifests are different between versions.

i-chvets commented 1 year ago

In previous training-operator charm (1.5/stable) resources were created without specifying field_manager. After the re-write with chisme and using Server-side Apply method field_manager is a required parameter and is specified in our latest charm as lightkube. As a result, changes in CRD are not allowed, because field_managers are different: not-specified one in previous charm versus lightkube in latest charm. Explanation of reason for the conflict: https://kubernetes.io/docs/reference/using-api/server-side-apply/

There is an option to do forced change. It will require charm code changes and chisme code changes.

i-chvets commented 1 year ago

Verified by this script lightkube-client.py

#!/usr/bin/python3

from lightkube import ApiError, Client, codecs
from lightkube.resources.apiextensions_v1 import CustomResourceDefinition

client = Client(field_manager="lightkube")

with open('crds_manifests.yaml') as f:
    for obj in codecs.load_all_yaml(f):
        try:
            res = client.apply(obj)
        except ApiError as err:
            print(f"Failed to apply {obj.metadata.name}: {err}")
            print("Apply with force='True'")
            res = client.apply(obj, force=True)

After deployment of 1.5/stable:

$ kubectl get crds -l app.juju.is/created-by=training-operator -o=yaml | grep "ontroller-gen.kubebuilder.io/version"
      controller-gen.kubebuilder.io/version: v0.6.0
      controller-gen.kubebuilder.io/version: v0.6.0
      controller-gen.kubebuilder.io/version: v0.6.0
      controller-gen.kubebuilder.io/version: v0.6.0
      controller-gen.kubebuilder.io/version: v0.6.0

Use script above to update CRDs. Result:

$ ./lightkube-client.py 
Failed to apply xgboostjobs.kubeflow.org: Apply failed with 2 conflicts: conflicts with "python-httpx" using apiextensions.k8s.io/v1:
- .metadata.annotations.controller-gen.kubebuilder.io/version
- .spec.versions
Apply with force='True'
Failed to apply tfjobs.kubeflow.org: Apply failed with 2 conflicts: conflicts with "python-httpx" using apiextensions.k8s.io/v1:
- .metadata.annotations.controller-gen.kubebuilder.io/version
- .spec.versions
Apply with force='True'
Failed to apply pytorchjobs.kubeflow.org: Apply failed with 2 conflicts: conflicts with "python-httpx" using apiextensions.k8s.io/v1:
- .metadata.annotations.controller-gen.kubebuilder.io/version
- .spec.versions
Apply with force='True'
Failed to apply mxjobs.kubeflow.org: Apply failed with 2 conflicts: conflicts with "python-httpx" using apiextensions.k8s.io/v1:
- .metadata.annotations.controller-gen.kubebuilder.io/version
- .spec.versions
Apply with force='True'
Failed to apply mpijobs.kubeflow.org: Apply failed with 2 conflicts: conflicts with "python-httpx" using apiextensions.k8s.io/v1:
- .metadata.annotations.controller-gen.kubebuilder.io/version
- .spec.versions
Apply with force='True'
$ kubectl get crds -l app.juju.is/created-by=training-operator -o=yaml | grep "ontroller-gen.kubebuilder.io/version"
      controller-gen.kubebuilder.io/version: v0.10.0
      controller-gen.kubebuilder.io/version: v0.10.0
      controller-gen.kubebuilder.io/version: v0.10.0
      controller-gen.kubebuilder.io/version: v0.10.0
      controller-gen.kubebuilder.io/version: v0.10.0
i-chvets commented 1 year ago

Need to address this issue first.

i-chvets commented 1 year ago

Proposed fix:

i-chvets commented 1 year ago

Lastest upgrade fails:

unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.juju-log Encountered a conflict: Apply failed with 1 conflict: conflict with "python-httpx" using rbac.authorization.k8s.io/v1: .rules
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready Error in sys.excepthook:
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready Traceback (most recent call last):
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/usr/lib/python3.8/logging/__init__.py", line 954, in handle
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready     self.emit(record)
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/ops/log.py", line 41, in emit
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready     self.model_backend.juju_log(record.levelname, self.format(record))
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/usr/lib/python3.8/logging/__init__.py", line 929, in format
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready     return fmt.format(record)
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/usr/lib/python3.8/logging/__init__.py", line 676, in format
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready     record.exc_text = self.formatException(record.exc_info)
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/usr/lib/python3.8/logging/__init__.py", line 626, in formatException
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready     traceback.print_exception(ei[0], ei[1], tb, None, sio)
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/usr/lib/python3.8/traceback.py", line 103, in print_exception
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready     for line in TracebackException(
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/usr/lib/python3.8/traceback.py", line 617, in format
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready     yield from self.format_exception_only()
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready   File "/usr/lib/python3.8/traceback.py", line 566, in format_exception_only
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready     stype = smod + '.' + stype
unit-training-operator-0: 15:11:04 WARNING unit.training-operator/0.training-operator-pebble-ready TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
i-chvets commented 1 year ago

Looks like it happens when pebble-ready event is handled. It cannot resolve and upgrade charm event handler never gets to run. I will add troubleshooting section to upgrade guide on how to recover, but we need to look into it. I think, pebble ready event handler needs rework.

i-chvets commented 1 year ago

Fix has been merged.