SeldonIO / seldon-core

An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models
https://www.seldon.io/tech/products/core/
Other
4.4k stars 832 forks source link

V1: Error syncing deployment - Operation cannot be fulfilled on replicasets.apps #5082

Open stephen37 opened 1 year ago

stephen37 commented 1 year ago

Describe the bug

When deployment a new SeldonDeployment and expecting pods to be restarted with the new version, not all pods are created with the new version and the Kube-controller-manager complains about

"Error syncing deployment" deployment="analytics/name_of_the_sdep" err="Operation cannot be fulfilled on replicasets.apps \"pod-name-7894b6f9dd\": the object has been modified; please apply your changes to the latest version and try again"

I haven't checked with Seldon-Core V2, those are on Seldon-core V1 and it has been happening since Seldon Core 1.16.0

To reproduce

  1. Deploy a model in K8s with multiple pods running
  2. Deploy a model with another Docker tag for example
  3. Observe that not all pods will be updated
  4. kube-controller-manager complains with the error mentioned above

Expected behaviour

All pods should be updated with the latest version of the Docker image and it's not the case

> kubectl get pods | grep <name>
name-0-main-7894b6f9dd-2nqs7            2/2     Running       0              31h
name-0-main-7894b6f9dd-8knxc            2/2     Running       0              21m
name-0-main-7894b6f9dd-b7d9m            2/2     Running       0              31h
name-0-main-7894b6f9dd-b9tlr            2/2     Running       0              31h
name-0-main-7894b6f9dd-ddsp9            2/2     Running       0              21m
name-0-main-7894b6f9dd-dlq9j            2/2     Running       0              21m

They should all be 21m old but for some of them, the deployment sync has errored.

Environment

Vavinash-github commented 1 year ago

Hi @stephen37 @cliveseldon Facing a similar issue Old pods are not getting deleted when new pods are up.. Is this issue resolved in later versions of seldon??

stephen37 commented 1 year ago

Hey,

The solution we found is to always define the number of replicas in the Seldon Deployment. That way the pods are always updated

bcvanmeurs commented 8 months ago

We have a very similar issue with the same logs, where somehow the controller keeps saying that the deployments are the same and then tries to reconcile. But actually nothing seems to happen and the model seems fine. Though argocd says the deployment is stuck. I don't fully understand what is going on, the deployment is similar to all our other deployments. We have set the number of replicas to a fixed number but it still happens. Any thoughts on why the operator might think there are duplicate deployments and services?

When I describe the SeldonDeployment I see this:

  Type    Reason   Age                        From                       Message
  ----    ------   ----                       ----                       -------
  Normal  Updated  3m47s (x1451532 over 20h)  seldon-controller-manager  Updated SeldonDeployment "xxx"

edit: I did find the issue and reported it here: https://github.com/SeldonIO/seldon-core/issues/5435