canonical / seldon-core-operator

Seldon Core Operator
Apache License 2.0
5 stars 10 forks source link

upgrade from 1.14 to 1.15 fails due to 409 conflict during k8s resource creation #147

Closed ca-scribner closed 1 year ago

ca-scribner commented 1 year ago

During upgrade the charm gets stuck with 409 conflict errors during k8s resource creation.

Reproduction steps:

juju deploy seldon-core seldon-controller-manager --channel 1.14/stable
juju trust seldon-controller-manager --scope=cluster
juju refresh seldon-controller-manager --channel 1.15/edge

Which yields logs of:

unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: GET https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log Rendering manifests
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/namespaces/kubeflow/roles/leader-election-role?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/manager-role?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/manager-sas-role?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/kubeflow-edit-seldon?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/namespaces/kubeflow/rolebindings/leader-election-rolebinding?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/manager-rolebinding?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/manager-sas-rolebinding?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations/seldon-validating-webhook-configuration?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/api/v1/namespaces/kubeflow/services/seldon-webhook-service?fieldManager=lightkube "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log Reconcile completed successfully
unit-seldon-controller-manager-0: 15:21:15 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: GET https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions "HTTP/1.1 200 OK"
unit-seldon-controller-manager-0: 15:21:16 INFO unit.seldon-controller-manager/0.juju-log Rendering manifests
unit-seldon-controller-manager-0: 15:21:17 INFO unit.seldon-controller-manager/0.juju-log HTTP Request: PATCH https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions/seldondeployments.machinelearning.seldon.io?fieldManager=lightkube "HTTP/1.1 409 Conflict"
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.juju-log Encountered a conflict: Apply failed with 1 conflict: conflict with "manager" using apiextensions.k8s.io/v1: .spec.versions
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Error in sys.excepthook:
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Traceback (most recent call last):
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/usr/lib/python3.8/logging/__init__.py", line 954, in handle
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     self.emit(record)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/log.py", line 41, in emit
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     self.model_backend.juju_log(record.levelname, self.format(record))
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/usr/lib/python3.8/logging/__init__.py", line 929, in format
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     return fmt.format(record)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/usr/lib/python3.8/logging/__init__.py", line 676, in format
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     record.exc_text = self.formatException(record.exc_info)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/usr/lib/python3.8/logging/__init__.py", line 626, in formatException
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     traceback.print_exception(ei[0], ei[1], tb, None, sio)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/usr/lib/python3.8/traceback.py", line 103, in print_exception
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     for line in TracebackException(
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/usr/lib/python3.8/traceback.py", line 617, in format
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     yield from self.format_exception_only()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/usr/lib/python3.8/traceback.py", line 566, in format_exception_only
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     stype = smod + '.' + stype
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install 
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Original exception was:
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Traceback (most recent call last):
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/generic_client.py", line 188, in raise_for_status
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     resp.raise_for_status()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/httpx/_models.py", line 749, in raise_for_status
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     raise HTTPStatusError(message, request=request, response=self)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install httpx.HTTPStatusError: Client error '409 Conflict' for url 'https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions/seldondeployments.machinelearning.seldon.io?fieldManager=lightkube'
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install For more information check: https://httpstatuses.com/409
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install 
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install During handling of the above exception, another exception occurred:
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install 
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Traceback (most recent call last):
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "./src/charm.py", line 331, in _apply_k8s_resources
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     self.crd_resource_handler.apply()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/charmed_kubeflow_chisme/kubernetes/_kubernetes_resource_handler.py", line 351, in apply
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     raise e
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/charmed_kubeflow_chisme/kubernetes/_kubernetes_resource_handler.py", line 336, in apply
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     apply_many(client=self.lightkube_client, objs=resources, force=force)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/charmed_kubeflow_chisme/lightkube/batch/_many.py", line 64, in apply_many
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     returns[i] = client.apply(
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/client.py", line 457, in apply
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     return self.patch(type(obj), name, obj, namespace=namespace,
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/client.py", line 325, in patch
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     return self._client.request("patch", res=res, name=name, namespace=namespace, obj=obj,
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/generic_client.py", line 245, in request
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     return self.handle_response(method, resp, br)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/generic_client.py", line 196, in handle_response
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     self.raise_for_status(resp)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/lightkube/core/generic_client.py", line 190, in raise_for_status
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     raise transform_exception(e)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install lightkube.core.exceptions.ApiError: Apply failed with 1 conflict: conflict with "manager" using apiextensions.k8s.io/v1: .spec.versions
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install 
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install The above exception was the direct cause of the following exception:
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install 
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install Traceback (most recent call last):
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "./src/charm.py", line 523, in <module>
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     main(SeldonCoreOperator)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/main.py", line 439, in main
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     framework.reemit()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/framework.py", line 840, in reemit
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     self._reemit()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "/var/lib/juju/agents/unit-seldon-controller-manager-0/charm/venv/ops/framework.py", line 919, in _reemit
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     custom_handler(event)
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "./src/charm.py", line 357, in _on_install
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     self._apply_k8s_resources()
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install   File "./src/charm.py", line 340, in _apply_k8s_resources
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install     raise GenericCharmRuntimeError("CRD resources creation failed") from error
unit-seldon-controller-manager-0: 15:21:17 WARNING unit.seldon-controller-manager/0.install <unknown>GenericCharmRuntimeError: CRD resources creation failed
unit-seldon-controller-manager-0: 15:21:17 ERROR juju.worker.uniter.operation hook "install" (via hook dispatching script: dispatch) failed: exit status 1
unit-seldon-controller-manager-0: 15:21:17 INFO juju.worker.uniter awaiting error resolution for "install" hook

where we see 409 conflict errors when creating the CRDs.

(feels similar to canonical/training-operator#104, but that issue was going between sidecar charms whereas this is going from podspec to sidecar)

ca-scribner commented 1 year ago

Looking through the logs, it looks like this happened during an install event. _on_install does not try to force apply the kubernetes resources, which makes sense why it would fail with a 409 error.

Not sure how we reach an install event at all during an upgrade - maybe this is a quirk about going from podspec to sidecar?

DnPlas commented 1 year ago

I have read about this kind of conflicts, SSA and CSA, and I have also tried reproducing the issue myself. Here are some important notes:

Understanding Server-Side Apply (SSA) and Client-Side Apply (CSA)

Podspec charms use the CSA method for applying Kubernetes resources. With this method, the client (kubectl, or whatever client juju uses) is responsible for diffing the desired vs current state of the resources. Lightkube on the other hand uses SSA, which sets a field manager that is responsible for tracking changes in each field of a Kubernetes resource. From [2]:

Fields are assigned a “field manager” which identifies the client that owns them. If you apply a manifest with Kubectl, then Kubectl will be the designated manager. A field’s manager could also be a controller or an external integration that updates your objects. Managers are forbidden from updating each other’s fields. You’ll be blocked from changing a field with kubectl apply if it’s currently owned by a different controller.

Links: [1] Server Side Apply [2] What Is Kubernetes Server-Side Apply (SSA)?

Conflict resolution

On SSA, "A conflict is a special status error that occurs when an Apply operation tries to change a field, which another user also claims to manage." From official docs, the options for resolving conflicts are:

Identify SSA and CSA managed kubernetes resources

To identify which method was used for applying the resource, it is as easy as looking into the yaml file format of the object. If it was a last-applied-configuration annotation, the resource is managed by CSA; it's SSA managed if metadata.managedFields is present.

For example: ---- CSA ----

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"nginx","namespace":"default"},"spec":{"containers":[{"image":"nginx:latest","name":"nginx"}]}}
  creationTimestamp: "2022-11-24T14:20:07Z"
  name: nginx
  namespace: default

---- SSA ----

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2022-11-24T16:02:29Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:containers:
          k:{"name":"nginx"}:
            .: {}
            f:image: {}
            f:name: {}
    manager: kubectl
    operation: Apply
    time: "2022-11-24T16:02:29Z"

Upgrading from CSA to SSA

Look here

What's wrong with seldon-core-operator then?

Version 1.14 of seldon-core-operator is a podspec charm, which means it was created and managed by the CSA method. When we try to juju refresh to 1.15, a sidecar charm managed by lightkube which uses SSA, there appears to be a conflict. According to the documentation, upgrading from CSA to SSA is fairly easy, but conflicts may be raised:

Keep the last-applied-configuration annotation up to date. The annotation infers client-side apply's managed fields. Any fields not managed by client-side apply raise conflicts. For example, if you used kubectl scale to update the replicas field after client-side apply, then this field is not owned by client-side apply and creates conflicts on kubectl apply --server-side.

This is likely the case for seldon-core-operator, we may investigate more.

Reproducing the issue

I was able to reproduce the issue, but in the end the charm seems to resolve its own conflicts and go active after a couple minutes.

Should we use force=True every time?

SSA was introduced as a way to facilitate conflict detection, flexible resolution strategies, and prevent unintentional or accidental overwrites without warning. Always setting this option to True may be okay for most of the charms use cases, but it's important to understand why. To me, it seems like we want the charm to be the only responsible for patching and updating the Kubernetes resources tied to it, thus be the only responsible for setting the fieldManager to make changes, so always using force=True is a way for ensuring this. With this, we always say the fieldManager for all fields is whatever the charm dictates and that field values will be OVERWRITTEN whenever the charm calls the apply() method.

NohaIhab commented 1 year ago

fixed by #148