cconnert commented 2 years ago

Summary

The documentation of the Argcd App health check needs to be updated (https://argo-cd.readthedocs.io/en/stable/operator-manual/health/#argocd-app)

Motivation

The issue we encountered is the given LUA script only consider the health of the resource but not the sync status. Thus we ended up that sometimes sync waves got triggered even though Apps were not yet synced properly

Proposal

We extended the LUA script to also consider the sync state as follows:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
  labels:
    app.kubernetes.io/name: argocd-cm
    app.kubernetes.io/part-of: argocd
data:
  resource.customizations.health.argoproj.io_Application: | 
    hs = {}
    hs.status = "Progressing"
    hs.message = ""
    if obj.status ~= nil then
      if obj.status.health ~= nil then
        hs.status = obj.status.health.status
        if obj.status.health.message ~= nil then
          hs.message = obj.status.health.message
        end
        local syncStatus = (obj.status.sync and obj.status.sync.status or nil)
        if hs.status == "Healthy" and syncStatus ~= "Synced" then
          hs.status = "Progressing"
        end
      end
    end
    return hs

roelvdberg commented 2 years ago

I have tried this health check but unfortunately it keeps waiting when hitting a sync window. Unfortunately we can't detect sync windows in the application manifest.

cconnert commented 2 years ago

I have tried this health check but unfortunately it keeps waiting when hitting a sync window. Unfortunately we can't detect sync windows in the application manifest.

What exactly do you mean by sync windows (https://argo-cd.readthedocs.io/en/stable/user-guide/sync_windows/)? We currently do not use sync windows. But how does the original check and other custom resource check **(https://github.com/argoproj/argo-cd/tree/master/resource_customizations) cope with sync windows?

We noticed that the app-of-app might get stuck in "Progressing", but decided it's better then having a non synced rollout in our scenario. I guess we can add a termination check for this case (retryCount and limit limit should be processable),

roelvdberg commented 2 years ago

Most resource customization don't have to check sync windows. They don't check the ArgoCD applications. Even the current check doesn't break on sync windows because he doesn't check the sync state.

This change introduces the issue that the application is processing as long as the application isn't in sync. Which is correct in almost all the scenarios I can think off except when someone is using a sync window.

The Sync Policy which can be retrieved by using the ArgoCD cli isn't stored in the status of the application. This makes it impossible to exclude those applications from the proposed sync status check. I added an extra annotation to indicate that it is an application using a sync window and can be excluded from this new check.

cconnert commented 2 years ago

Thanks for the explanation.

Ihmo. the sync of an Application can only be aborted a controller and not within the custom resource check.

Anyhow I enhanced the check to respect retry limit and count (not tested yet):

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
  labels:
    app.kubernetes.io/name: argocd-cm
    app.kubernetes.io/part-of: argocd
data:
  resource.customizations.health.argoproj.io_Application: | 
    hs = {}
    hs.status = "Progressing"
    hs.message = ""
    if obj.status ~= nil then
      if obj.status.health ~= nil then
        hs.status = obj.status.health.status
        if obj.status.health.message ~= nil then
          hs.message = obj.status.health.message
        end
        local syncStatus = (obj.status.sync and obj.status.sync.status or nil)
        if hs.status == "Healthy" and syncStatus ~= "Synced" then
          hs.status = "Progressing"
        end
        if obj.operationState ~= nil then
          local operationState = obj.operationState
          local retryLimit = (operationState.operation and operationState.operation.retry and operationState.operation.retry.limit or nil)
          local retryCount = (operationState.retryCount or nil)
          if retryLimit == retryCount and hs.status == "Progressing" then
            hs.status = "Degraded"
            hs.message = "Retry limit reached while syncing all Application"
          end
        end
      end
    end
    return hs

Where did you add the extra annotation?

roelvdberg commented 2 years ago

The annotation is on the application which contains a sync window and we added to the application which has a sync window.

I see that we have a similar addition to check the operationsState. We check if the operationState phase is failed.

if hs.status == "Healthy" and obj.status.operationState ~= nil and obj.status.operationState.phase == "Failed" then
  hs.status = "Degraded"
  hs.message = obj.status.operationState.message
end

I am not sure if it is better to check the retry limit, never tested that.

cconnert commented 2 years ago

Thanks for the suggestion, I'll try out both approaches and report my results

cconnert commented 2 years ago

Hi,

so it turned out that we actually did not need to look the retry count / operation status. Instead we added a check for error conditions:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
  labels:
    app.kubernetes.io/name: argocd-cm
    app.kubernetes.io/part-of: argocd
data:
  resource.customizations.useOpenLibs.argoproj.io_Application: "true"
  resource.customizations.health.argoproj.io_Application: | # see https://argo-cd.readthedocs.io/en/stable/operator-manual/health/#argocd-app
    hs = {}
    hs.status = "Progressing"
    hs.message = ""
    if obj.status ~= nil then
      local status = obj.status
      if status.conditions ~= nil then
        for i, condition in ipairs(status.conditions) do
          if condition.type ~= nil and string.match(condition.type, '.*Error$') then
            hs.status = "Degraded"
            hs.message = condition.message
            return hs
          end
        end
      end
      if status.health ~= nil then
        local health = status.health
        hs.status = health.status
        if health.message ~= nil then
          hs.message = health.message
        end
        local syncStatus = (status.sync and status.sync.status or nil)
        if hs.status == "Healthy" and syncStatus ~= "Synced" then
          hs.status = "Progressing"
        end
      end
    end
    return hs

aslafy-z commented 1 year ago

@cconnert can you update OP with the updated lua script and re-add the return hs at the end? Thank you!

cconnert commented 1 year ago

@aslafy-z Got lost during coping, thanks for pointing it out. Update the post

argoproj / argo-cd

Argocd App health check #10550

Summary

Motivation

Proposal