Argo fails to detect pod ready state for some operators

sarahhenkens commented 3 years ago

Checklist:

[x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
[x] I've included steps to reproduce the bug.
[x] I've pasted the output of argocd version.

Describe the bug

When using the ArangoDB operator with ArgoCD. Any pod created (and attached to the custom resource) gets stuck in a forever "Progressing" state. While a kube describe pod <pod-id> is showing a ready state:

Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True

Discussion in Slack: https://cloud-native.slack.com/archives/C01TSERG0KZ/p1631432657161800

To Reproduce

Install the ArangoDb operator in your cluster

export URLPREFIX=https://github.com/arangodb/kube-arangodb/releases/download/1.2.2
helm install $URLPREFIX/kube-arangodb-crd-1.2.2.tgz
helm install $URLPREFIX/kube-arangodb-1.2.2.tgz

Apply the following demo project in ArgoCD:

It will load all the examples from https://github.com/arangodb/kube-arangodb into the default namespace.

project: default
source:
  repoURL: 'https://github.com/arangodb/kube-arangodb'
  path: examples
  targetRevision: HEAD
destination:
  server: 'https://kubernetes.default.svc'
  namespace: default

Observe the lifecycle of the generates pods/svcs/endpoints by the operator within argocd.

Expected behavior

The pod is expected to show as Healthy in the ArgoCD UI and reports once running with the Ready state set to true.

Version

v2.1.2+7af9dfb

sarahhenkens commented 3 years ago

Hmm I think this is the same issue as https://github.com/argoproj/argo-cd/issues/7182. This operator sets the restart policy to "never" and argocd keeps those into a progressing state:

inside: getCorev1PodHealth:

    case corev1.PodRunning:
        switch pod.Spec.RestartPolicy {
        case corev1.RestartPolicyAlways:
            // if pod is ready, it is automatically healthy
            if podutils.IsPodReady(pod) {
                return &HealthStatus{
                    Status:  HealthStatusHealthy,
                    Message: pod.Status.Message,
                }, nil
            }
            // if it's not ready, check to see if any container terminated, if so, it's degraded
            for _, ctrStatus := range pod.Status.ContainerStatuses {
                if ctrStatus.LastTerminationState.Terminated != nil {
                    return &HealthStatus{
                        Status:  HealthStatusDegraded,
                        Message: pod.Status.Message,
                    }, nil
                }
            }
            // otherwise we are progressing towards a ready state
            return &HealthStatus{
                Status:  HealthStatusProgressing,
                Message: pod.Status.Message,
            }, nil
        case corev1.RestartPolicyOnFailure, corev1.RestartPolicyNever:
            // pods set with a restart policy of OnFailure or Never, have a finite life.
            // These pods are typically resource hooks. Thus, we consider these as Progressing
            // instead of healthy.
            return &HealthStatus{
                Status:  HealthStatusProgressing,
                Message: pod.Status.Message,
            }, nil
        }
    }

sarahhenkens commented 3 years ago

Root cause inside the ArangoDB operator: https://github.com/arangodb/kube-arangodb/blob/13f3e2a09b4c6c08f050efffc364d498b1293dcf/pkg/util/k8sutil/pods.go#L433

Is there a better way to let ArgoCD still let pods be considered healthy with a custom setting?

wanghong230 commented 3 years ago

What is the rational reason to do Never from ArangoDB? I believe that question has to be explored.

sarahhenkens commented 3 years ago

From the linked ticket:

We do not want to allow Pod restarts, full lifecycle is managed by Operator (Operator recreate pod, takes care about shards).

wanghong230 commented 3 years ago

We need to have a quick discussion about this. I will bring it up in tomorrow's maintainer meeting.

wanghong230 commented 3 years ago

The same issue: https://github.com/argoproj/argo-cd/issues/7182

sarahhenkens commented 3 years ago

@wanghong230, Any updates from the maintainer meeting?

michael-barker commented 2 years ago

I have this same issue when using the Spark Operator. The driver and executors have a restart policy of never and continue to show progressing when the pod state is running.

mh013370 commented 2 years ago

There are many operators that behave this way including Koperator and NiFiKop. This behavior should at least be configurable through an Application/ApplicationSet.

trasyia commented 2 years ago

I have same issue with spark operator, too. Many operators make pods' restart policy to Never.

lintong commented 10 months ago

... and also with the Task Manager container, that is managed by the Flink Operator.

https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#pod-template

mikejoh commented 6 months ago

Related/duplicate issue: https://github.com/argoproj/argo-cd/issues/7182.

argoproj / argo-cd

Argo fails to detect pod ready state for some operators #7259