argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
18.07k stars 5.53k forks source link

Prometheus-operator sync stuck progressing #11261

Open strowi opened 2 years ago

strowi commented 2 years ago

Checklist:

Describe the bug

Trying to deploy prometheus as a crd, the resource ist synced ok but stuck progressing waiting for healthy state of monitoring.coreos.com/Prometheus/rancheer-monitoring-prometheus.

To Reproduce

Deploy the rancher-monitoring-crd + rancher-monitoring via helm

Create repo with

helm fetch rancher-charts/rancher-monitoring-crd --version 100.1.3+up19.0.3
helm fetch rancher-charts/rancher-monitoring --version 100.1.3+up19.0.3
mkdir apps && for x in *tgz; do tar xf $x -C apps;done;
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: cluster-addons
spec:
  generators:
    - matrix:
        generators:
          - git:
              repoURL: https://gitlab.com/aedifion.io/XYZ/cluster-addons.git
              revision: HEAD
              directories:
              - path: apps/*

          - clusters:
              selector:
                matchLabels:
                  monitoring: "true"
  template:
    metadata:
      name: '{{path.basename}}'
    spec:
      project: system
      source:
        repoURL: https://gitlab.com/aedifion.io/XYZ/cluster-addons.git
        targetRevision: HEAD
        path: '{{path}}'
      destination:
        server: "{{ server }}"
        namespace: 'cattle-monitoring-system'
      syncPolicy:
        automated: {}

Expected behavior

Rancher-Monitoring Chart should be fully synced and healthy.

Screenshots

image

Version

argocd: v2.4.6+a48bca0
  BuildDate: 2022-07-12T22:56:26Z
  GitCommit: a48bca03c79b6d63be0c34d6094831bc6916b3bc
  GitTreeState: clean
  GoVersion: go1.18.3
  Compiler: gc
  Platform: linux/amd64
WARN[0000] Failed to invoke grpc call. Use flag --grpc-web in grpc calls. To avoid this warning message, use flag --grpc-web. 
argocd-server: v2.5.2+148d8da
  BuildDate: 2022-11-07T16:42:47Z
  GitCommit: 148d8da7a996f6c9f4d102fdd8e688c2ff3fd8c7
  GitTreeState: clean
  GoVersion: go1.18.8
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.5.7 2022-08-02T16:35:54Z
  Helm Version: v3.10.1+g9f88ccb
  Kubectl Version: v0.24.2
  Jsonnet Version: v0.18.0

Logs

time="2022-11-10T10:48:07Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for healthy state of monitoring.coreos.com/Prometheus/rancher-monitoring-prometheus" application=argocd/rancher-monitoring
# can provide more if releveant, but cant find anything useful;)

Might be related to https://github.com/argoproj/argo-cd/pull/10508

regards, strowi

joaosilva15 commented 2 years ago

Hey @strowi 👋 I don't think this is an argocd bug. What version of Prometheus operator are you using? The status was only added in version 0.56 of the operator. If you use something previous to that the health check won't be in the CRD and the health will be always progressing

drogbeer commented 2 years ago

After upgrading to 2.5.0 I've had the same issue. Also noticed the same behaviour on the demo site https://cd.apps.argoproj.io/applications/prometheus-operator?resource=

strowi commented 2 years ago

@joaosilva15 hey and thank you for the info. That could be indeed the cause rancher is still using 0.50.x. @drogbeer i think that is also an older version of the prometheus-operator the code suggests to me 0.46..

mkilchhofer commented 2 years ago

I also faced the issue but after upgrading the monitoring stack from Chart version 34.9.0 to 41.7.4, everything is fine again.

ZiaUrRehman-GBI commented 1 year ago

GKE clusters ArgoCD version v2.5.4 Hey @mkilchhofer, I'm facing this health Progressing in kind: prometheus object in the prometheus version prometheus ● quay.io/prometheus/prometheus:v2.40.0

aslafy-z commented 1 year ago

My ArgoCD manages different versions of kube-prometheus-stack, some are still not exposing the status, some other are. I just did the following patch, this will allow me to well discover their status:

    resource.customizations.health.monitoring.coreos.com_Prometheus: |
      if obj.metadata.annotations ~= nil and obj.metadata.annotations["argocd.argoproj.io/skip-health-check"] ~= nil then
        hs = {}
        hs.status = "Healthy"
        hs.message = "Ignoring Prometheus Health Check"
        return hs
      end

      hs={ status = "Progressing", message = "Waiting for initialization" }
      if obj.status ~= nil then
        if obj.status.conditions ~= nil then
          for i, condition in ipairs(obj.status.conditions) do

            if condition.type == "Available" and condition.status ~= "True" then
              if condition.reason == "SomePodsNotReady" then
                hs.status = "Progressing"
              else
                hs.status = "Degraded"
              end
              hs.message = condition.message or condition.reason
            end
            if condition.type == "Available" and condition.status == "True" then
              hs.status = "Healthy"
              hs.message = "All instances are available"
            end
          end
        end
      end
      return hs

I then added the argocd.argoproj.io/skip-health-check to my old Prometheus objects.

Related discussion: https://cloud-native.slack.com/archives/C01TSERG0KZ/p1671558024083149

cheskayang commented 1 year ago

same issue after upgrading to 2.6.0

murand78 commented 1 year ago

Workaround showed by aslafy-z did not work for me as-is with ARGOCD 2.6.3 Rancher monitoring helm chart 100.1.3+up19.0.3

Need to configure argocd-cm like ( NOTE:: obj.metadata.annotations ) :

kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  resource.customizations: |
    monitoring.coreos.com/Prometheus:
      health.lua: |
        if obj.metadata.annotations ~= nil and obj.metadata.annotations["argocd.argoproj.io/skip-health-check"] ~= nil then
          hs = {}
          hs.status = "Healthy"
          hs.message = "Ignoring Prometheus Health Check"
          return hs
        end

        hs={ status = "Progressing", message = "Waiting for initialization" }
        if obj.status ~= nil then
          if obj.status.conditions ~= nil then
            for i, condition in ipairs(obj.status.conditions) do

              if condition.type == "Available" and condition.status ~= "True" then
                if condition.reason == "SomePodsNotReady" then
                  hs.status = "Progressing"
                else
                  hs.status = "Degraded"
                end
                hs.message = condition.message or condition.reason
              end
              if condition.type == "Available" and condition.status == "True" then
                hs.status = "Healthy"
                hs.message = "All instances are available"
              end
            end
          end
        end
        return hs
  resource.customizations.useOpenLibs.monitoring.coreos.com_Prometheus: 'true'

And pass the Prometheus annotation in the values.yaml file to the rancher-monitoring helm chart

  prometheus:
    annotations:
      argocd.argoproj.io/skip-health-check: 'true'
FalconerTC commented 1 year ago

Still happening with kube-prometheus-stack 45.27.2 (prometheus v0.65.1)

streaming-pete commented 1 year ago

Workaround showed by aslafy-z did not work for me as-is with ARGOCD 2.6.3 Rancher monitoring helm chart 100.1.3+up19.0.3

Need to configure argocd-cm like ( NOTE:: obj.metadata.annotations ) :

kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  resource.customizations: |
    monitoring.coreos.com/Prometheus:
      health.lua: |
        if obj.metadata.annotations ~= nil and obj.metadata.annotations["argocd.argoproj.io/skip-health-check"] ~= nil then
          hs = {}
          hs.status = "Healthy"
          hs.message = "Ignoring Prometheus Health Check"
          return hs
        end

        hs={ status = "Progressing", message = "Waiting for initialization" }
        if obj.status ~= nil then
          if obj.status.conditions ~= nil then
            for i, condition in ipairs(obj.status.conditions) do

              if condition.type == "Available" and condition.status ~= "True" then
                if condition.reason == "SomePodsNotReady" then
                  hs.status = "Progressing"
                else
                  hs.status = "Degraded"
                end
                hs.message = condition.message or condition.reason
              end
              if condition.type == "Available" and condition.status == "True" then
                hs.status = "Healthy"
                hs.message = "All instances are available"
              end
            end
          end
        end
        return hs
  resource.customizations.useOpenLibs.monitoring.coreos.com_Prometheus: 'true'

And pass the Prometheus annotation in the values.yaml file to the rancher-monitoring helm chart

  prometheus:
    annotations:
      argocd.argoproj.io/skip-health-check: 'true'

Argo 2.7.3 - rancher-monitoring 102.0.0+up40.1.2

application would deploy but never complete, adding this workaround 'fixed' the issue after a few syncs. Thanks @murand78. Ace work!

TheMatrix97 commented 7 months ago

Hi! Same problem here with the latest kube-prometheus-stack version 58.1.3. Not sure if this should be from argocd side... Although, the patch proposed is quite an ad-hoc solution that might be interesting to add in argocd. We could implement the argocd.argoproj.io/skip-health-check in argocd, to allow healthcheck skipping to conflictive resources like this one

Edit: I think this should be considered https://github.com/argoproj/argo-cd/issues/11782