Open cconnert opened 2 years ago
I have tried this health check but unfortunately it keeps waiting when hitting a sync window. Unfortunately we can't detect sync windows in the application manifest.
I have tried this health check but unfortunately it keeps waiting when hitting a sync window. Unfortunately we can't detect sync windows in the application manifest.
What exactly do you mean by sync windows (https://argo-cd.readthedocs.io/en/stable/user-guide/sync_windows/)? We currently do not use sync windows. But how does the original check and other custom resource check **(https://github.com/argoproj/argo-cd/tree/master/resource_customizations) cope with sync windows?
We noticed that the app-of-app might get stuck in "Progressing", but decided it's better then having a non synced rollout in our scenario. I guess we can add a termination check for this case (retryCount and limit limit should be processable),
Most resource customization don't have to check sync windows. They don't check the ArgoCD applications. Even the current check doesn't break on sync windows because he doesn't check the sync state.
This change introduces the issue that the application is processing as long as the application isn't in sync. Which is correct in almost all the scenarios I can think off except when someone is using a sync window.
The Sync Policy which can be retrieved by using the ArgoCD cli isn't stored in the status of the application. This makes it impossible to exclude those applications from the proposed sync status check. I added an extra annotation to indicate that it is an application using a sync window and can be excluded from this new check.
Thanks for the explanation.
Ihmo. the sync of an Application can only be aborted a controller and not within the custom resource check.
Anyhow I enhanced the check to respect retry limit and count (not tested yet):
---
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cm
namespace: argocd
labels:
app.kubernetes.io/name: argocd-cm
app.kubernetes.io/part-of: argocd
data:
resource.customizations.health.argoproj.io_Application: |
hs = {}
hs.status = "Progressing"
hs.message = ""
if obj.status ~= nil then
if obj.status.health ~= nil then
hs.status = obj.status.health.status
if obj.status.health.message ~= nil then
hs.message = obj.status.health.message
end
local syncStatus = (obj.status.sync and obj.status.sync.status or nil)
if hs.status == "Healthy" and syncStatus ~= "Synced" then
hs.status = "Progressing"
end
if obj.operationState ~= nil then
local operationState = obj.operationState
local retryLimit = (operationState.operation and operationState.operation.retry and operationState.operation.retry.limit or nil)
local retryCount = (operationState.retryCount or nil)
if retryLimit == retryCount and hs.status == "Progressing" then
hs.status = "Degraded"
hs.message = "Retry limit reached while syncing all Application"
end
end
end
end
return hs
Where did you add the extra annotation?
The annotation is on the application which contains a sync window and we added to the application which has a sync window.
I see that we have a similar addition to check the operationsState. We check if the operationState phase is failed.
if hs.status == "Healthy" and obj.status.operationState ~= nil and obj.status.operationState.phase == "Failed" then
hs.status = "Degraded"
hs.message = obj.status.operationState.message
end
I am not sure if it is better to check the retry limit, never tested that.
Thanks for the suggestion, I'll try out both approaches and report my results
Hi,
so it turned out that we actually did not need to look the retry count / operation status. Instead we added a check for error conditions:
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cm
namespace: argocd
labels:
app.kubernetes.io/name: argocd-cm
app.kubernetes.io/part-of: argocd
data:
resource.customizations.useOpenLibs.argoproj.io_Application: "true"
resource.customizations.health.argoproj.io_Application: | # see https://argo-cd.readthedocs.io/en/stable/operator-manual/health/#argocd-app
hs = {}
hs.status = "Progressing"
hs.message = ""
if obj.status ~= nil then
local status = obj.status
if status.conditions ~= nil then
for i, condition in ipairs(status.conditions) do
if condition.type ~= nil and string.match(condition.type, '.*Error$') then
hs.status = "Degraded"
hs.message = condition.message
return hs
end
end
end
if status.health ~= nil then
local health = status.health
hs.status = health.status
if health.message ~= nil then
hs.message = health.message
end
local syncStatus = (status.sync and status.sync.status or nil)
if hs.status == "Healthy" and syncStatus ~= "Synced" then
hs.status = "Progressing"
end
end
end
return hs
@cconnert can you update OP with the updated lua script and re-add the return hs
at the end? Thank you!
@aslafy-z Got lost during coping, thanks for pointing it out. Update the post
Summary
The documentation of the Argcd App health check needs to be updated (https://argo-cd.readthedocs.io/en/stable/operator-manual/health/#argocd-app)
Motivation
The issue we encountered is the given LUA script only consider the health of the resource but not the sync status. Thus we ended up that sometimes sync waves got triggered even though Apps were not yet synced properly
Proposal
We extended the LUA script to also consider the sync state as follows: