Deployment stuck with ReplicaSet "X" is progressing

orfeas-k commented 9 months ago

Bug Description

During the effort of updating seldon charm to 1.17.1 at https://github.com/canonical/seldon-core-operator/pull/216, we bumped across the following bug when running the integration tests.

The deployment created by applied seldondeployment, gets stuck with a condition ReplicaSet "X" is progressing (as a result, it's never ready). The issue though is that underlying ReplicaSet creates a pod successfully and its status shows readeyReplicas: 1. At the same time, the deployment has an observedGeneration of 7 (or sth like that) while the ReplicaSet has its observedGeneration set to 1.

This looks like the same issue we 've hit when trying to update Seldon ROCKs described in this issue's comments https://github.com/canonical/seldonio-rocks/issues/37#issuecomment-1716758029.

Debugging

During debugging we tried to apply manually with kubectl apply -f yaml the aforementioned seldondeployment and noticed that:

when we applied it to the default namespace, deployment was progressing successfully
when we applied it to the namespace created by tests, the deployment was stuck (just like what happens in the tests)

Here's the namespaces' yaml outputs

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    controller.juju.is/id: ec8f226d-bdc2-45de-891d-7cc8b8f501ff
    model.juju.is/id: 73fb3200-e151-46b2-8e87-f814f48f1715
  creationTimestamp: "2023-10-03T08:46:52Z"
  labels:
    app.kubernetes.io/managed-by: juju
    kubernetes.io/metadata.name: test-charm-vmxo
    model.juju.is/name: test-charm-vmxo
    serving.kubeflow.org/inferenceservice: enabled
  name: test-charm-vmxo
  resourceVersion: "439145"
  uid: 82929f14-80fc-4049-b4ab-c74c93ec0e30
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

and

apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: "2023-10-02T08:39:37Z"
  labels:
    kubernetes.io/metadata.name: default
  name: default
  resourceVersion: "482729"
  uid: 0cf53d5e-b0f5-482e-88a2-6be71e24fe02
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

Tests run in the namespace created by a test juju model. In that namespace, they try to apply a custom resource, which in turn creates a deployment. The issue is that, while the ReplicaSet creates a pod (successfully) and its status shows readeyReplicas: 1, the deployment gets stuck with a condition ReplicaSet "X" is progressing (as a result, it's never ready). Same thing happens if I try to apply manually the custom resource in the testing namespace too. However, if I apply this to default namespace, the Deployment goes to Ready as expected.

To Reproduce

Checkout the PR's branch
Set up cluster with juju 2.9.45 and Microk8s 1.24 (tried also with 1.26 but issue persists)
Run integration tests with tox -e charm-integration or tox -e seldon-servers-integration

Environment

Juju 2.9.45
Microk8s 1.24
Ubuntu 22.04

Relevant log output

Doing `microk8s inspect` and looking at `snap.microk8s.daemon-kubelite/journal.log` we noticed this error during the deployment creation

Oct 03 08:51:43 ip-172-31-36-226 microk8s.daemon-kubelite[39625]: E1003 08:51:43.873775   39625 fieldmanager.go:211] "[SHOULD NOT HAPPEN] failed to update managedFields" VersionKind="/, Kind=" namespace="test-charm-vmxo" name="seldon-model-1-example-0-classifier"
[..]
(also a bunch of those)
Oct 03 08:51:47 ip-172-31-36-226 microk8s.daemon-kubelite[39625]: E1003 08:51:47.585896   39625 deployment_controller.go:495] Operation cannot be fulfilled on replicasets.apps "seldon-model-1-example-0-classifier-9df54f658": the object has been modified; please apply your changes to the latest version and try again

`$ kubectl get deployments -n kubeflow seldon-model-1-example-0-classifier -o yaml`
[...]
status:
  conditions:
  - lastTransitionTime: "2023-09-29T13:15:00Z"
    lastUpdateTime: "2023-09-29T13:15:00Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2023-09-29T13:15:00Z"
    lastUpdateTime: "2023-09-29T13:15:00Z"
    message: ReplicaSet "seldon-model-1-example-0-classifier-b7dfcbcb5" is progressing.
    reason: ReplicaSetUpdated
    status: "True"
    type: Progressing
  observedGeneration: 7
  replicas: 1
  unavailableReplicas: 1
  updatedReplicas: 1

`$ kubectl get replicaset -n kubeflow seldon-model-1-example-0-classifier-b7dfcbcb5 -o yaml`
[..]
status:
  availableReplicas: 1
  fullyLabeledReplicas: 1
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1

Additional context

No response

orfeas-k commented 9 months ago

Tried running the tests EKS too. The situation was bit (but not much) different here. The deployment goes to Ready 1/1 with the following conditions

status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2023-10-04T11:12:04Z"
    lastUpdateTime: "2023-10-04T11:12:04Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2023-10-04T11:10:24Z"
    lastUpdateTime: "2023-10-04T11:12:04Z"
    message: ReplicaSet "seldon-model-1-example-0-classifier-5b844bbc69" has successfully
      progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 2249
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

However, observedGeneration keeps increasing forever. This results (I guess) in the SeldonDeployment (the custom resource that created the aforementioned deployment) to get stuck at state: Creating.

I can see a bunch of those here too

2023-10-04T12:48:53.557474818Z stdout F 2023-10-04T12:48:53.557Z [seldon-core] {"level":"error","ts":1696423733.5572436,"logger":"controller.seldon-controller-manager","msg":"Reconciler error","reconciler group":"machinelearning.seldon.io","reconciler kind":"SeldonDeployment","name":"seldon-model-1","namespace":"kubeflow","error":"Operation cannot be fulfilled on deployments.apps \"seldon-model-1-example-0-classifier\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"}

orfeas-k commented 9 months ago

Namespaces

Tried applying the same seldondeployment into a newly created namespace and deployments goes to Ready there too, Thus, we conclude that the deployment will work in any namespace that is not the model namespace.

orfeas-k commented 9 months ago

For now, we 've implemented a workaround in our tests in #220 where we apply seldondeployments to the namespace default.

orfeas-k commented 9 months ago

Workaround

To summarize the above, in the case that seldondeployments are applied to the namespace created by juju model and fail to go into state: ready, one should create and apply them to a different namespace than the one created by juju (kubeflow for CKF deployments)

lpfann commented 9 months ago

We ran into this today with seldon-core 1.16.0. The Deployment is also stuck in Progressing state and keeping the old replica set with a pod running.

  conditions:
    - type: Available
      status: 'True'
      lastUpdateTime: '2023-10-17T14:12:11Z'
      lastTransitionTime: '2023-10-17T14:12:11Z'
      reason: MinimumReplicasAvailable
      message: Deployment has minimum availability.
    - type: Progressing
      status: 'True'
      lastUpdateTime: '2023-10-25T13:42:27Z'
      lastTransitionTime: '2023-10-17T12:52:30Z'
      reason: NewReplicaSetCreated
      message: >-
        Created new replica set
        "model-name-predictor-0-model-5dbd4c788d"

For us the reason is a bit different as NewReplicaSetCreated. Deleting the old ReplicaSet does not get in unstuck.

canonical / seldon-core-operator