Problem with job id - re-creating an instance won't be able to reconcile

Haishi2016 commented 1 week ago

As now the code checks for job id freshness, when an instance object is deleted and then recreated it can't be reconciled anymore because there's older summary with higher job id.

Haishi2016 commented 1 week ago

I think the key design error here is to save the job id into the instance object annotations. Instance shouldn't care about that job id and saving the job id in it causes such condition.

msftcoderdjw commented 1 week ago

@Haishi2016 , I cannot repro this issue. I have tried to delete the instance and recreate it, the reconciliation still works. My repro steps are appended in the last.

Please note, when deleting the instance, the corresponding summary will be deleted as well. So, when creating new instance, the summaryJobIdKey in newly-created instance will be set to 1, and there's no legacy instance summary, which won't block the new reconciliation.

We saved the job id to the instance object annotations, and it will be changed at each time the requeue happens, so it is the similar route as queuing deployment job. And when deletion, both the instance object and summary will be deleted.

My repro steps:

First creation - instance is created and the summaryJobIdKey is 1.

kubectl describe instance sample-prometheus-instance
Name:         sample-prometheus-instance
Namespace:    default
Labels:       solution=sample-prometheus-server-v-v1
              target=sample-k8s-target
Annotations:  SummaryJobIdKey: 1
              instance.solution.symphony/started-at: 2024-11-15T02:22:31Z
              management.azure.com/correlationId: 6869d8b5-a0f1-4f44-b2c0-62f7e1cd4d9b
              management.azure.com/runningCorrelationId: 6869d8b5-a0f1-4f44-b2c0-62f7e1cd4d9b
API Version:  solution.symphony/v1
Kind:         Instance
Metadata:
  Creation Timestamp:  2024-11-15T02:22:31Z
  Finalizers:
    instance.solution.symphony/finalizer
  Generation:        1
  Resource Version:  3345
  UID:               574a53b9-3b0f-4cd3-9a57-cd41a007a6c9
Spec:
  Display Name:  sample-prometheus-instance
  Scope:         sample-k8s-scope
  Solution:      sample-prometheus-server:v1
  Target:
    Name:  sample-k8s-target
Status:
  Last Modified:  2024-11-15T02:23:01Z
  Properties:
    Deployed:                                            1
    Expected Running Job Id:                             1
    Generation:                                          1
    Running Job Id:                                      1
    Status:                                              Succeeded
    Status - Details:
    Targets:                                             1
    targets.sample-k8s-target:                           OK -
    targets.sample-k8s-target.sample-prometheus-server:  Untouched - No error. sample-prometheus-server is untouched
  Provisioning Status:
    Error:
    Operation Id:
    Output:
      sample-k8s-target.sample-prometheus-server:  Untouched
    Percent Complete:                              100
    Status:                                        Succeeded
Events:                                            <none>

Modify the solution and the instance reconciliation happens again - instance is requeued and the summaryJobIdKey is 2.

Name:         sample-prometheus-instance
Namespace:    default
Labels:       solution=sample-prometheus-server-v-v1
          target=sample-k8s-target
Annotations:  SummaryJobIdKey: 2
          instance.solution.symphony/started-at: 2024-11-15T02:24:42Z
          management.azure.com/correlationId: c5ab4146-a1de-4839-983e-a34070a44134
          management.azure.com/runningCorrelationId: c5ab4146-a1de-4839-983e-a34070a44134
API Version:  solution.symphony/v1
Kind:         Instance
Metadata:
Creation Timestamp:  2024-11-15T02:22:31Z
Finalizers:
instance.solution.symphony/finalizer
Generation:        1
Resource Version:  3526
UID:               574a53b9-3b0f-4cd3-9a57-cd41a007a6c9
Spec:
Display Name:  sample-prometheus-instance
Scope:         sample-k8s-scope
Solution:      sample-prometheus-server:v1
Target:
Name:  sample-k8s-target
Status:
Last Modified:  2024-11-15T02:24:52Z
Properties:
Deployed:                                            0
Expected Running Job Id:                             2
Generation:                                          1
Running Job Id:                                      2
Status:                                              Reconciling
Status - Details:                                    1 total deployments on 1 targets, current completed 0 deployments.
Targets:                                             1
targets.sample-k8s-target:                           OK -
targets.sample-k8s-target.sample-prometheus-server:  Untouched
Provisioning Status:
Error:
Operation Id:
Status:        Reconciling
Events:            <none>

Delete the instance and recreate new one - instance is newly created and the summaryJobIdKey is 1.

$  kubectl delete instance sample-prometheus-instance
instance.solution.symphony "sample-prometheus-instance" deleted
$ kubectl apply -f instance.yaml
instance.solution.symphony/sample-prometheus-instance created

kubectl get instance -o yaml
apiVersion: v1
items:
- apiVersion: solution.symphony/v1
  kind: Instance
  metadata:
    annotations:
      SummaryJobIdKey: "1"
      instance.solution.symphony/started-at: "2024-11-15T02:27:23Z"
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"solution.symphony/v1","kind":"Instance","metadata":{"annotations":{},"name":"sample-prometheus-instance","namespace":"default"},"spec":{"scope":"sample-k8s-scope","solution":"sample-prometheus-server:v1","target":{"name":"sample-k8s-target"}}}
      management.azure.com/correlationId: 05be6f2d-f2f8-4026-88d8-ea4d7e82317a
      management.azure.com/runningCorrelationId: 05be6f2d-f2f8-4026-88d8-ea4d7e82317a
    creationTimestamp: "2024-11-15T02:27:23Z"
    finalizers:
    - instance.solution.symphony/finalizer
    generation: 1
    labels:
      solution: sample-prometheus-server-v-v1
      target: sample-k8s-target
    name: sample-prometheus-instance
    namespace: default
    resourceVersion: "3841"
    uid: c9080390-4a79-424f-b592-d2b8fa952403
  spec:
    displayName: sample-prometheus-instance
    scope: sample-k8s-scope
    solution: sample-prometheus-server:v1
    target:
      name: sample-k8s-target
  status:
    lastModified: "2024-11-15T02:27:33Z"
    properties:
      deployed: "1"
      expectedRunningJobId: "1"
      generation: "1"
      runningJobId: "1"
      status: Succeeded
      status-details: ""
      targets: "1"
      targets.sample-k8s-target: 'OK - '
      targets.sample-k8s-target.sample-prometheus-server: Untouched - No error. sample-prometheus-server
        is untouched
    provisioningStatus:
      error: {}
      operationId: ""
      output:
        sample-k8s-target.sample-prometheus-server: Untouched
      percentComplete: 100
      status: Succeeded
kind: List
metadata:
  resourceVersion: ""

msftcoderdjw commented 1 week ago

Also + @FireDefend , @RemindD to be aware of this.

RemindD commented 1 week ago

Hi, @Haishi2016 . As @msftcoderdjw mentioned above, deleting an instance would clear its summary and won't block the reconciliation of a new instance with the same name in the happy scenario.

However, there is another issue that may be related. For example, we have an update call and then a delete call on the same instance. These two calls are received by symphony API server around the same time, so it is possible that delete is executed first and then update. Although the delete call has a higher job id, it deletes the summary so the update call can't tell it's actually an outdated call.

The consequence of the issue is that an orphan deployment could exist when instance is already gone. And it may block reconciliation of re-created instance.

Please let me if you have any different opinion.

msftcoderdjw commented 1 week ago

Add @linyguo FYI.

linyguo commented 1 week ago

Hi @Haishi2016, I was able to repro the issue when creating and updating targets. I recorded the root cause here: https://github.com/eclipse-symphony/symphony/issues/561. Is this the same issue that you mentioned? Please let me know if I missed anything.

msftcoderdjw commented 1 week ago

I also observed another issue in current remote agent case, which will cause problem in instance recreation when this instance maps to a remote target. I feel this might be the similar pattern @Haishi2016 observed. Please correct me if I am wrong.

Briefly speaking, this is related symphony agent now also use the same logic as symphony control plane, it will also save summary in its memory but it is never cleaned up even if the instance is cleaned up in control cluster.

When instance is deleted, the summary in control cluster will be deleted. Once the new instance is created, there's no summary in control cluster, so the summaryJobId check in control cluster side will pass. But because the orphan data is leaked in agent side (memory summary), agent side will refuse to do the deployment because the new job is lower than the summaryJobId in leaked summary in agent, which caused the problem @Haishi2016 observed.

@RemindD to comment if I misunderstood current summaryJobId logic.

I think there's no need to keep the summary in agent side. In agent, it should be stateless.

eclipse-symphony / symphony

Problem with job id - re-creating an instance won't be able to reconcile #558