Open Haishi2016 opened 1 week ago
I think the key design error here is to save the job id into the instance object annotations. Instance shouldn't care about that job id and saving the job id in it causes such condition.
@Haishi2016 , I cannot repro this issue. I have tried to delete the instance and recreate it, the reconciliation still works. My repro steps are appended in the last.
Please note, when deleting the instance, the corresponding summary will be deleted as well. So, when creating new instance, the summaryJobIdKey in newly-created instance will be set to 1, and there's no legacy instance summary, which won't block the new reconciliation.
We saved the job id to the instance object annotations, and it will be changed at each time the requeue happens, so it is the similar route as queuing deployment job. And when deletion, both the instance object and summary will be deleted.
My repro steps:
kubectl describe instance sample-prometheus-instance
Name: sample-prometheus-instance
Namespace: default
Labels: solution=sample-prometheus-server-v-v1
target=sample-k8s-target
Annotations: SummaryJobIdKey: 1
instance.solution.symphony/started-at: 2024-11-15T02:22:31Z
management.azure.com/correlationId: 6869d8b5-a0f1-4f44-b2c0-62f7e1cd4d9b
management.azure.com/runningCorrelationId: 6869d8b5-a0f1-4f44-b2c0-62f7e1cd4d9b
API Version: solution.symphony/v1
Kind: Instance
Metadata:
Creation Timestamp: 2024-11-15T02:22:31Z
Finalizers:
instance.solution.symphony/finalizer
Generation: 1
Resource Version: 3345
UID: 574a53b9-3b0f-4cd3-9a57-cd41a007a6c9
Spec:
Display Name: sample-prometheus-instance
Scope: sample-k8s-scope
Solution: sample-prometheus-server:v1
Target:
Name: sample-k8s-target
Status:
Last Modified: 2024-11-15T02:23:01Z
Properties:
Deployed: 1
Expected Running Job Id: 1
Generation: 1
Running Job Id: 1
Status: Succeeded
Status - Details:
Targets: 1
targets.sample-k8s-target: OK -
targets.sample-k8s-target.sample-prometheus-server: Untouched - No error. sample-prometheus-server is untouched
Provisioning Status:
Error:
Operation Id:
Output:
sample-k8s-target.sample-prometheus-server: Untouched
Percent Complete: 100
Status: Succeeded
Events: <none>
Modify the solution and the instance reconciliation happens again - instance is requeued and the summaryJobIdKey is 2.
Name: sample-prometheus-instance
Namespace: default
Labels: solution=sample-prometheus-server-v-v1
target=sample-k8s-target
Annotations: SummaryJobIdKey: 2
instance.solution.symphony/started-at: 2024-11-15T02:24:42Z
management.azure.com/correlationId: c5ab4146-a1de-4839-983e-a34070a44134
management.azure.com/runningCorrelationId: c5ab4146-a1de-4839-983e-a34070a44134
API Version: solution.symphony/v1
Kind: Instance
Metadata:
Creation Timestamp: 2024-11-15T02:22:31Z
Finalizers:
instance.solution.symphony/finalizer
Generation: 1
Resource Version: 3526
UID: 574a53b9-3b0f-4cd3-9a57-cd41a007a6c9
Spec:
Display Name: sample-prometheus-instance
Scope: sample-k8s-scope
Solution: sample-prometheus-server:v1
Target:
Name: sample-k8s-target
Status:
Last Modified: 2024-11-15T02:24:52Z
Properties:
Deployed: 0
Expected Running Job Id: 2
Generation: 1
Running Job Id: 2
Status: Reconciling
Status - Details: 1 total deployments on 1 targets, current completed 0 deployments.
Targets: 1
targets.sample-k8s-target: OK -
targets.sample-k8s-target.sample-prometheus-server: Untouched
Provisioning Status:
Error:
Operation Id:
Status: Reconciling
Events: <none>
Delete the instance and recreate new one - instance is newly created and the summaryJobIdKey is 1.
$ kubectl delete instance sample-prometheus-instance
instance.solution.symphony "sample-prometheus-instance" deleted
$ kubectl apply -f instance.yaml
instance.solution.symphony/sample-prometheus-instance created
kubectl get instance -o yaml
apiVersion: v1
items:
- apiVersion: solution.symphony/v1
kind: Instance
metadata:
annotations:
SummaryJobIdKey: "1"
instance.solution.symphony/started-at: "2024-11-15T02:27:23Z"
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"solution.symphony/v1","kind":"Instance","metadata":{"annotations":{},"name":"sample-prometheus-instance","namespace":"default"},"spec":{"scope":"sample-k8s-scope","solution":"sample-prometheus-server:v1","target":{"name":"sample-k8s-target"}}}
management.azure.com/correlationId: 05be6f2d-f2f8-4026-88d8-ea4d7e82317a
management.azure.com/runningCorrelationId: 05be6f2d-f2f8-4026-88d8-ea4d7e82317a
creationTimestamp: "2024-11-15T02:27:23Z"
finalizers:
- instance.solution.symphony/finalizer
generation: 1
labels:
solution: sample-prometheus-server-v-v1
target: sample-k8s-target
name: sample-prometheus-instance
namespace: default
resourceVersion: "3841"
uid: c9080390-4a79-424f-b592-d2b8fa952403
spec:
displayName: sample-prometheus-instance
scope: sample-k8s-scope
solution: sample-prometheus-server:v1
target:
name: sample-k8s-target
status:
lastModified: "2024-11-15T02:27:33Z"
properties:
deployed: "1"
expectedRunningJobId: "1"
generation: "1"
runningJobId: "1"
status: Succeeded
status-details: ""
targets: "1"
targets.sample-k8s-target: 'OK - '
targets.sample-k8s-target.sample-prometheus-server: Untouched - No error. sample-prometheus-server
is untouched
provisioningStatus:
error: {}
operationId: ""
output:
sample-k8s-target.sample-prometheus-server: Untouched
percentComplete: 100
status: Succeeded
kind: List
metadata:
resourceVersion: ""
Also + @FireDefend , @RemindD to be aware of this.
Hi, @Haishi2016 . As @msftcoderdjw mentioned above, deleting an instance would clear its summary and won't block the reconciliation of a new instance with the same name in the happy scenario.
However, there is another issue that may be related. For example, we have an update call and then a delete call on the same instance. These two calls are received by symphony API server around the same time, so it is possible that delete is executed first and then update. Although the delete call has a higher job id, it deletes the summary so the update call can't tell it's actually an outdated call.
The consequence of the issue is that an orphan deployment could exist when instance is already gone. And it may block reconciliation of re-created instance.
Please let me if you have any different opinion.
Add @linyguo FYI.
Hi @Haishi2016, I was able to repro the issue when creating and updating targets. I recorded the root cause here: https://github.com/eclipse-symphony/symphony/issues/561. Is this the same issue that you mentioned? Please let me know if I missed anything.
I also observed another issue in current remote agent case, which will cause problem in instance recreation when this instance maps to a remote target. I feel this might be the similar pattern @Haishi2016 observed. Please correct me if I am wrong.
Briefly speaking, this is related symphony agent now also use the same logic as symphony control plane, it will also save summary in its memory but it is never cleaned up even if the instance is cleaned up in control cluster.
When instance is deleted, the summary in control cluster will be deleted. Once the new instance is created, there's no summary in control cluster, so the summaryJobId check in control cluster side will pass. But because the orphan data is leaked in agent side (memory summary), agent side will refuse to do the deployment because the new job is lower than the summaryJobId in leaked summary in agent, which caused the problem @Haishi2016 observed.
I think there's no need to keep the summary in agent side. In agent, it should be stateless.
As now the code checks for job id freshness, when an instance object is deleted and then recreated it can't be reconciled anymore because there's older summary with higher job id.