canonical / kserve-operators

Charmed KServe
4 stars 2 forks source link

Kserve controller fails to install in complete Kubeflow deployment #96

Closed i-chvets closed 1 year ago

i-chvets commented 1 year ago

Description

Kserve controller fails to install in complete Kubeflow deployment

Found during upgrade. Kserve was deployed last after the whole Kubeflow was upgraded. Looks like it is due to conflicts when updating K8S resource. Most likely due to field management.

Status of deployment when Kserve deployed last (juju deploy kserve-controller --channel 0.10/stable --trust):

Model     Controller          Cloud/Region        Version  SLA          Timestamp
kubeflow  microk8s-localhost  microk8s/localhost  2.9.42   unsupported  15:28:41Z

App                        Version                    Status   Scale  Charm                    Channel         Rev  Address         Exposed  Message
admission-webhook          res:oci-image@2d74d1b      active       1  admission-webhook        1.7/stable      134  10.152.183.103  no       
argo-controller            res:oci-image@669ebd5      active       1  argo-controller          3.3/stable      236                  no       
argo-server                res:oci-image@576d038      active       1  argo-server              3.3/stable      185                  no       
dex-auth                                              active       1  dex-auth                 2.31/stable     224  10.152.183.109  no       
istio-pilot                                           active       1  istio-pilot              1.16/stable     387  10.152.183.67   no       
jupyter-controller         res:oci-image@1167186      active       1  jupyter-controller       1.7/stable      607                  no       
jupyter-ui                 res:oci-image@d55c600      active       1  jupyter-ui               1.7/stable      534  10.152.183.209  no       
katib-controller           res:oci-image@111495a      active       1  katib-controller         0.15/stable     206  10.152.183.99   no       
katib-db                   mariadb/server:10.3        active       1  charmed-osm-mariadb-k8s  latest/stable    35  10.152.183.192  no       ready
katib-db-manager           res:oci-image@2fd18aa      active       1  katib-db-manager         0.15/stable     180  10.152.183.112  no       
katib-ui                   res:oci-image@c7dc04a      active       1  katib-ui                 0.15/stable     194  10.152.183.44   no       
kfp-api                    res:oci-image@e08e41d      active       1  kfp-api                  2.0/stable      298  10.152.183.182  no       
kfp-db                     mariadb/server:10.3        active       1  charmed-osm-mariadb-k8s  latest/stable    35  10.152.183.126  no       ready
kfp-persistence            res:oci-image@516e6b8      active       1  kfp-persistence          2.0/stable      294                  no       
kfp-profile-controller     res:oci-image@6278f3e      active       1  kfp-profile-controller   2.0/stable      274  10.152.183.204  no       
kfp-schedwf                res:oci-image@1f6d4b5      active       1  kfp-schedwf              2.0/stable      312                  no       
kfp-ui                     res:oci-image@ae72602      active       1  kfp-ui                   2.0/stable      297  10.152.183.168  no       
kfp-viewer                 res:oci-image@c2f2ee1      active       1  kfp-viewer               2.0/stable      310                  no       
kfp-viz                    res:oci-image@3de6f3c      active       1  kfp-viz                  2.0/stable      281  10.152.183.226  no       
knative-eventing                                      active       1  knative-eventing         1.8/stable      165  10.152.183.151  no       
knative-operator                                      active       1  knative-operator         1.8/stable      142  10.152.183.156  no       
knative-serving                                       active       1  knative-serving          1.8/stable      164  10.152.183.34   no       
kserve-controller                                     waiting      1  kserve-controller        0.10/stable      86  10.152.183.102  no       installing agent
kubeflow-dashboard         res:oci-image@6fe6eec      active       1  kubeflow-dashboard       1.7/stable      307  10.152.183.79   no       
kubeflow-profiles          res:profile-image@cfd6935  active       1  kubeflow-profiles        1.7/stable      269  10.152.183.48   no       
kubeflow-roles                                        active       1  kubeflow-roles           1.7/stable      113  10.152.183.33   no       
kubeflow-volumes           res:oci-image@d261609      active       1  kubeflow-volumes         1.7/stable      178  10.152.183.11   no       
metacontroller-operator                               active       1  metacontroller-operator  2.0/stable      117  10.152.183.224  no       
minio                      res:oci-image@1755999      active       1  minio                    ckf-1.7/stable  186  10.152.183.128  no       
oidc-gatekeeper            res:oci-image@6b720b8      active       1  oidc-gatekeeper          ckf-1.7/stable  176  10.152.183.154  no       
seldon-controller-manager  res:oci-image@eb811b6      active       1  seldon-core              1.15/stable     298  10.152.183.31   no       
tensorboard-controller     res:oci-image@c52f7c2      active       1  tensorboard-controller   1.7/stable      156  10.152.183.77   no       
tensorboards-web-app       res:oci-image@929f55b      active       1  tensorboards-web-app     1.7/stable      158  10.152.183.251  no       
training-operator                                     active       1  training-operator        1.6/stable      190  10.152.183.12   no       

Unit                          Workload  Agent  Address      Ports              Message
admission-webhook/1*          active    idle   10.1.27.89   4443/TCP           
argo-controller/0*            active    idle   10.1.27.95                      
argo-server/0*                active    idle   10.1.27.82   2746/TCP           
dex-auth/0*                   active    idle   10.1.27.77                      
istio-pilot/0*                active    idle   10.1.27.85                      
jupyter-controller/1*         active    idle   10.1.27.83                      
jupyter-ui/0*                 active    idle   10.1.27.80                      
katib-controller/1*           active    idle   10.1.27.79   443/TCP,8080/TCP   
katib-db-manager/1*           active    idle   10.1.27.86   6789/TCP           
katib-db/0*                   active    idle   10.1.27.99   3306/TCP           ready
katib-ui/0*                   active    idle   10.1.27.93                      
kfp-api/0*                    active    idle   10.1.27.103  8888/TCP,8887/TCP  
kfp-db/0*                     active    idle   10.1.27.108  3306/TCP           ready
kfp-persistence/0*            active    idle   10.1.27.126                     
kfp-profile-controller/0*     active    idle   10.1.27.94   80/TCP             
kfp-schedwf/0*                active    idle   10.1.27.119                     
kfp-ui/0*                     active    idle   10.1.27.128  3000/TCP           
kfp-viewer/0*                 active    idle   10.1.27.120                     
kfp-viz/0*                    active    idle   10.1.27.122  8888/TCP           
knative-eventing/0*           active    idle   10.1.27.154                     
knative-operator/0*           active    idle   10.1.27.98                      
knative-serving/0*            active    idle   10.1.27.125                     
kserve-controller/0*          error     idle   10.1.27.163                     hook failed: "install"
kubeflow-dashboard/0*         active    idle   10.1.27.104                     
kubeflow-profiles/0*          active    idle   10.1.27.121                     
kubeflow-roles/0*             active    idle   10.1.27.101                     
kubeflow-volumes/1*           active    idle   10.1.27.90   5000/TCP           
metacontroller-operator/0*    active    idle   10.1.27.91                      
minio/0*                      active    idle   10.1.27.111  9000/TCP,9001/TCP  
oidc-gatekeeper/1*            active    idle   10.1.27.124  8080/TCP           
seldon-controller-manager/0*  active    idle   10.1.27.114                     
tensorboard-controller/1*     active    idle   10.1.27.88   9443/TCP           
tensorboards-web-app/1*       active    idle   10.1.27.70   5000/TCP           
training-operator/0*          active    idle   10.1.27.115      

Juju debug log:

unit-kserve-controller-0: 15:24:40 ERROR unit.kserve-controller/0.juju-log Apply failed with 2 conflicts: conflicts with "python-httpx" using rbac.authorization.k8s.io/v1:
- .metadata.labels.app
- .metadata.labels.app.kubernetes.io/name
unit-kserve-controller-0: 15:24:40 ERROR unit.kserve-controller/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/lightkube/core/generic_client.py", line 188, in raise_for_status
    resp.raise_for_status()
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/httpx/_models.py", line 749, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '409 Conflict' for url 'https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/kubeflow-kserve-admin?fieldManager=lightkube'
For more information check: https://httpstatuses.com/409

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/charm.py", line 308, in <module>
    main(KServeControllerCharm)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/main.py", line 435, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/framework.py", line 355, in emit
    framework._emit(event)  # noqa
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/framework.py", line 824, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/framework.py", line 899, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 187, in _on_install
    self.k8s_resource_handler.apply()
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/charmed_kubeflow_chisme/kubernetes/_kubernetes_resource_handler.py", line 234, in apply
    raise e
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/charmed_kubeflow_chisme/kubernetes/_kubernetes_resource_handler.py", line 219, in apply
    apply_many(client=self.lightkube_client, objs=resources, force=force)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/charmed_kubeflow_chisme/lightkube/batch/_many.py", line 64, in apply_many
    returns[i] = client.apply(
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/lightkube/core/client.py", line 456, in apply
    return self.patch(type(obj), name, obj, namespace=namespace,
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/lightkube/core/client.py", line 325, in patch
    return self._client.request("patch", res=res, name=name, namespace=namespace, obj=obj,
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/lightkube/core/generic_client.py", line 245, in request
    return self.handle_response(method, resp, br)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/lightkube/core/generic_client.py", line 196, in handle_response
    self.raise_for_status(resp)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/lightkube/core/generic_client.py", line 190, in raise_for_status
    raise transform_exception(e)
lightkube.core.exceptions.ApiError: Apply failed with 2 conflicts: conflicts with "python-httpx" using rbac.authorization.k8s.io/v1:
- .metadata.labels.app
- .metadata.labels.app.kubernetes.io/name
unit-istio-pilot-0: 15:24:41 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kserve-controller-0: 15:24:41 ERROR juju.worker.uniter.operation hook "install" (via hook dispatching script: dispatch) failed: exit status 1
DnPlas commented 1 year ago

There is a conflict between kubeflow-roles 1.6/stable and kserve-controller 0.10/stable. Both charms are creating the same resource (or attempting to).

Error

So here's what happens:

  1. kubeflow-roles 1.6/stable is deployed and creates the following ClusterRole: kubeflow-kserve-edit, kubeflow-kserve-view, kubeflow-kserve-admin. This happens because in 1.6, kubeflow-roles was in charge of all the aggregation roles for kserve

  2. When you try to install kserve-controller 0.10/stable in a model where kubeflow-roles 1.6/stable is there or was deployed and refreshed to 1.7/stable, the conflict occurs because kserve-controller also wants to create those clusterRoles.

Reason:

kserve-controller was recently changed and in the latest version, we are rendering and applying the charm's roles and clusterRoles, including aggregation roles; therefore, assistance from kubeflow-roles to create those is not needed anymore.

Proposed solution

Option 1: If you need Charmed Kubeflow 1.6/stable alongside kserve 0.10/stable, you could:

  1. Deploy Charmed Kubeflow 1.6/stable as normal
  2. kubectl delete clusterroles <all the clusterroles I listed above>
  3. juju deploy kserve-controller --channel 0.10/stable

Option 2: If you are upgrading from Charmed Kubeflow 1.6 -> 1.7

  1. Deploy Charmed Kubeflow 1.6/stable as normal
  2. juju remove-application kubeflow-roles
  3. juju deploy kubeflow-roles --channel 1.7/stable <- this version won't create the conflicting clusterRoles
  4. juju deploy kserve-controller --channel 0.10/stable <- this charm will create its own clusterRoles
i-chvets commented 1 year ago

@DnPlas Do you want to proceed with option 2 for upgrade guide?

DnPlas commented 1 year ago

I think it's fair to proceed with Option 2, let's just mention that some applications may have some sort of downtime that will be fixed when the new roles charm is up.

i-chvets commented 1 year ago

Guide has been updated. Will do publishing to Discourse next.

i-chvets commented 1 year ago

Published with workarouind option 2: https://discourse.charmhub.io/t/how-to-upgrade-kubeflow-from-1-6-to-1-7/9367