Open maggie44 opened 3 weeks ago
I see in the logs you have "Refreshing app status (spec.source differs), level (3)
. Can you share the manifest of argocd/prometheus application? This log is surprising to me.
It is normal to see a burst of reconcile when an app is first deployed because all the resources will be modified by their respective controllers causing a lot of updates. However, if skipping the CRD fixed, my guess is that you had another app also syncing these CRDs. Both were fighting for ownership.
Have you tried the with the FailOnSharedResource=true option? it helps discover apps/resources that are in conflicts, which is often the case on non-namespaced kubernetes resources.
Here is the app as it is deployed:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: prometheus
namespace: argocd
annotations:
notifications.argoproj.io/subscribe.on-deployed.gh-tests-action-webhook: ""
notifications.argoproj.io/subscribe.on-deployed.telegram: -x
notifications.argoproj.io/subscribe.on-health-degraded.telegram: -x
spec:
destination:
server: "https://kubernetes.default.svc"
namespace: monitoring
syncPolicy:
# automated sync by default retries failed attempts 5 times with following delays between attempts ( 5s, 10s, 20s, 40s, 80s ); retry controlled using `retry` field.
automated:
selfHeal: true
syncOptions:
- ServerSideApply=true
project: default
source:
chart: bitnamicharts/kube-prometheus
repoURL: registry-1.docker.io
targetRevision: 9.6.3
helm:
skipCrds: true
valuesObject:
kube-state-metrics:
nodeSelector:
meshofthings.io/type: "nes"
node-exporter:
serviceMonitor:
# Changes the 'instance' label from the IP to the hostname
relabelings:
- action: replace
sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: instance
blackboxExporter:
nodeSelector:
meshofthings.io/type: "nes"
alertmanager:
serviceMonitor:
relabelings:
- action: replace
sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: instance
config:
route:
group_by: ["instance"]
group_wait: 30s
group_interval: 5m
repeat_interval: 30m
receiver: telegram-receiver
routes:
- receiver: null
receivers:
- name: telegram-receiver
telegram_configs:
- bot_token_file: /etc/alertmanager/secrets/telegram-api/TELEGRAM_API_KEY
chat_id: -966481322
nodeSelector:
meshofthings.io/type: "nes"
secrets:
- telegram-api
operator:
nodeSelector:
meshofthings.io/type: "nes"
prometheus:
enableFeatures:
- auto-gomemlimit
resources:
requests:
cpu: "10m"
memory: 1044Mi
ephemeral-storage: 100Mi
limits:
memory: 1500Mi
ephemeral-storage: 1024Mi
scrapeInterval: 60s
nodeSelector:
meshofthings.io/type: "nes"
additionalPrometheusRules:
- name: imported-alerts
groups:
- name: custom
rules:
- alert: NodeCordoned
expr: kube_node_status_condition{condition="Ready",status="false"} == 1
for: 1m
annotations:
summary: "Node {{ $labels.node }} cordoned"
description: "The node {{ $labels.node }} has been cordoned."
- alert: NodeU
...
Here is the source:
https://github.com/bitnami/charts/tree/main/bitnami/kube-prometheus
There isn't anything else installed on the cluster that I can think of that would be auto updating CRDs (i.e. something like ArgoCD). It is a very specific task. My best guess would have been another app deployed through ArgoCD that is asking for a difference, but if that was the case I would have expected to see both apps in the logs as in a reconcile loop. If it is another app, the most likely candidate would probably be Loki:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: loki
namespace: argocd
annotations:
notifications.argoproj.io/subscribe.on-deployed.gh-tests-action-webhook: ""
notifications.argoproj.io/subscribe.on-deployed.telegram: x
notifications.argoproj.io/subscribe.on-health-degraded.telegram: -x
spec:
project: default
source:
chart: loki
repoURL: https://grafana.github.io/helm-charts
targetRevision: 6.16.0
helm:
releaseName: loki
valuesObject:
mode: SingleBinary
loki:
commonConfig:
# If increasing the replica count, add S3 storage:
# https://grafana.com/docs/loki/latest/setup/install/helm/install-monolithic/
replication_factor: 1
storage:
type: "filesystem"
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
index:
prefix: loki_index_
period: 24h
object_store: filesystem # we're storing on filesystem so there's no real persistence here.
schema: v13
limits_config:
retention_period: 744h
compactor:
retention_enabled: true
delete_request_store: filesystem
auth_enabled: false
chunksCache:
enabled: false
resultsCache:
enabled: false
gateway:
nodeSelector:
meshofthings.io/type: "nes"
deploymentMode: SingleBinary
singleBinary:
replicas: 1
nodeSelector:
meshofthings.io/type: "nes"
read:
replicas: 0
backend:
replicas: 0
write:
replicas: 0
destination:
server: "https://kubernetes.default.svc"
namespace: monitoring
syncPolicy:
# automated sync by default retries failed attempts 5 times with following delays between attempts ( 5s, 10s, 20s, 40s, 80s ); retry controlled using `retry` field.
automated:
# Specifies if resources should be pruned during auto-syncing. Only applies to resources managed by ArgoCD.
prune: true
# Specifies if partial app sync should be executed when resources are changed only in target Kubernetes cluster and no git change detected ( false by default ).
selfHeal: true
syncOptions:
- ServerSideApply=true
- RespectIgnoreDifferences=true
# The helm values modify this template at deployment time, which means if will diff from the online version. This field ignores the modifed
# section to prevent an OutOfSync loop.
ignoreDifferences:
- group: apps
kind: StatefulSet
jsonPointers:
- /spec/volumeClaimTemplates
- group: monitoring.grafana.com
kind: PodLogs
jsonPointers:
- /spec/relabelings
Applying FailOnSharedResource=true
and disabling skipCrds
results in the same loop issue.
Checklist:
argocd version
.Describe the bug
Summary
I have deployed Prometheus using Helm charts. In the ArgoCD controller I was seeing the following error messages:
You can see from the above it is stuck in a loop, making constant requests to reconcile. The fix was to set
skipCrds: true
as it was not able to reconcile the difference.First issue is why the CRDs will not sync. There is no other attempts to sync. My best guess would be it is competing with another deployment with different CRD versions, but if that was the case I would expect to see that in the logs.
Second, the dashboard isn't displaying any issues and shows everything is reconciled.
Finally, what was experienced as a result of this loop is a loop of requests to the Kubernetes API that was spiking the CPU and memory usage. On my 8GB control plane, the memory dropped from 6.5GB to 4.8GB and started to stabilise at a more consistent level after I enabled
skipCrds: true
.The change in ArgoCD controller CPU usage after enabling skipCrds:
Proposal
Version
ArgoCD: v2.12.3+6b9cd82