SeldonIO / seldon-core

An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models
https://www.seldon.io/tech/products/core/
Other
4.37k stars 831 forks source link

missing "serviceStatus" field in "status" of sdep resource #4689

Open caozhuozi opened 1 year ago

caozhuozi commented 1 year ago

Describe the bug

We want to use the serviceStatus field of status of a sdep resource to get the svc info of all predictors. But sometimes it is missing in the sdep resource. Could you please help clarify when this field will be shown and when will not?

To reproduce

just kubectl apply a simple sdep yaml file to create a sdep and then kubectl get that sdep.

logs of seldon controller pod ("Updating Service" always happens in every reconcile): image image and there is no "Found identical Service" message found in the logs.

A sample svc created by seldon core(removed some data-sensitive part):

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2023-03-06T07:08:38Z"
  labels:
    app.kubernetes.io/managed-by: seldon-core
    seldon-app: ytseng-test-a5f13-default
    seldon-deployment-id: ytseng-test-a5f13
  name: ytseng-test-a5f13-default
  namespace: seldon
  ownerReferences:
  - apiVersion: machinelearning.seldon.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: SeldonDeployment
    name: ytseng-test-a5f13
    uid: 0b9e47ee-2abd-4e06-b315-ba85e453f8b8
  resourceVersion: "1020699785"
  uid: 348237d4-eeda-4d70-8440-a561aec74182
spec:
  clusterIP: 192.168.38.69
  clusterIPs:
  - 192.168.38.69
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: http
    port: 8000
    protocol: TCP
    targetPort: 8000
  - name: grpc
    port: 5001
    protocol: TCP
    targetPort: 5001
  selector:
    seldon-app: ytseng-test-a5f13-default
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

I think these fields are not required by seldon-core.

clusterIPs:
  - 192.168.38.69
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack

the corresponding sdep :

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  annotations:
    seldon.io/istio-host: aiplatform.dev51.cbf.dev.XXXX.com
  creationTimestamp: "2023-03-06T07:08:37Z"
  generation: 1
  name: ytseng-test-a5f13
  namespace: seldon
  resourceVersion: "1022247668"
  uid: 0b9e47ee-2abd-4e06-b315-ba85e453f8b8
spec:
  annotations:
    seldon.io/engine-seldon-log-messages-externally: "true"
    seldon.io/istio-host: aiplatform.dev51.cbf.dev.XXXX.com
  name: ytseng-test-a5f13
  predictors:
  - componentSpecs:
    - hpaSpec:
        maxReplicas: 1
        metrics:
        - resource:
            name: cpu
            targetAverageUtilization: 40
          type: Resource
        minReplicas: 1
      metadata:
        creationTimestamp: "2023-03-05T23:08:21Z"
      spec:
        containers:
        - env:
          - name: MLSERVER_MODEL_IMPLEMENTATION
            value: mlserver_ctranslate2.Ctranslate2Runtime
          name: ytseng-test-a5f13
          resources:
            limits:
              cpu: "8"
              memory: 8Gi
            requests:
              cpu: 100m
              memory: 1Gi
          securityContext:
            runAsUser: 0
    graph:
      implementation: CTRANSLATE2_SERVER
      modelUri: /tmp
      name: ytseng-test-a5f13
      parameters:
      - name: pretrained_model
        type: STRING
        value: /mnt/aiml_models
      - name: pretrained_tokenizer
        type: STRING
        value: /mnt/aiml_models/tokenizer
      type: MODEL
    name: default
    replicas: 1
    svcOrchSpec: {}
  protocol: v2
status:
  address:
    url: http://ytseng-test-a5f13-default.seldon.svc.cluster.local:8000/v2/models/ytseng-test-a5f13/infer
  conditions:
  - lastTransitionTime: "2023-03-07T00:51:26Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: DeploymentsReady
  - lastTransitionTime: "2023-03-06T07:08:38Z"
    reason: All HPAs resources ready
    status: "True"
    type: HpasReady
  - lastTransitionTime: "2023-03-06T07:08:38Z"
    reason: No KEDA resources defined
    status: "True"
    type: KedaReady
  - lastTransitionTime: "2023-03-06T07:08:38Z"
    reason: No PDBs defined
    status: "True"
    type: PdbsReady
  - lastTransitionTime: "2023-03-07T00:51:26Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-03-07T00:51:26Z"
    reason: All services created
    status: "True"
    type: ServicesReady
  - lastTransitionTime: "2023-03-06T07:09:43Z"
    reason: All VirtualServices created
    status: "True"
    type: istioVirtualServicesReady
  deploymentStatus:
    ytseng-test-a5f13-default-0-ytseng-test-a5f13:
      availableReplicas: 1
      replicas: 1
  replicas: 1
  state: Available

Expected behaviour

The serviceStatus field SHOULD be shown in status if the sdep is created successfully.

Environment

seldon-core version: 1.14.0

Model Details

I think it is model-independent.

ukclivecox commented 1 year ago

As discussed on community slack - this seems not to be an issue. Please reopen if its still an issue.

caozhuozi commented 1 year ago

Let me share my findings here. Please correct me if I'm wrong. (The line numbers given following are based on the master branch of seldondeployment_controller.go).

related variables:

the wanted svc(the svc that seldon-core wants to create/update. I don't use the word “desired” here as there is also a related variable called desiredSvc): first appears at line 1241

the found one(the svc may already exist, but it may have some differences from the wanted one): first appears at line 1245

desiredSvc (which is actually a deep copy of the found one): first appear sat line 1279

The case I encountered that would result in the difference between the found one and the wanted svc:

When creating a new svc, there will be also a clusterIPs field (and some other fields) set by default by my Kubernetes(v1.22.17-gke.3100). This field does not exist in the wanted svc but appears in the found one.

The above difference may lead to infinite re-updating in every reconcile:

Because of the differences between the found one and the wanted svc, the equality.Semantic.DeepEqual check at line 1262 will return false as expected.

After reassigning the wanted svc to the found one at line 1281, the found currently loses the field clusterIPs. And this will cause infinite re-update in every reconcile if there were no logic(at lines 1290-1302) to handle this situation. The logic intentionally makes ready to be false if the above situation happens. However, the problem comes at line 1290. When the above situation happens, the check statement equality.Semantic.DeepEqual(desiredSvc.Spec, found.Spec) will actually always return true. The reason is that found itself actually is also updated at line 1284(evidence: you can check the source code of the methodUpdate where there exists a Into(obj) statement. You can then check the doc of Into method, it says: Into stores the result into obj).

So the desiredSvc (which is a deep copy of found) may always be just compared to its newer counterpart. Thus, the comparison statement at line 1290 will just always return true and make the code goes to the else clause at line 1300, preventing the ready to be false. The infinite re-updating still happens in every reconcile.

caozhuozi commented 1 year ago

Can I raise a PR to fix it if the above analysis is correct?

ukclivecox commented 1 year ago

@caozhuozi Yes please raise a PR