emberstack / kubernetes-reflector

Custom Kubernetes controller that can be used to replicate secrets, configmaps and certificates.
MIT License
1.03k stars 91 forks source link

Significant CPU usage and possibly etcd usage when deploying this #447

Open drewwells opened 2 months ago

drewwells commented 2 months ago

We noticed our ETCD storage usage doubled after doing a production release that included deploying reflector. Is there an architecture document for how this service watches for object changes and decided on API calls to make to kubeapi?

We have one configmap that rarely changes. This is the labels and annotations on it.

metadata:
  annotations:
    checksum/configmap: 4420642124fb6c99affe13e8904ba3ede9bee1d41edc0df8a50696833fe15fca
    reflector.v1.k8s.emberstack.com/reflection-allowed: "true"
    reflector.v1.k8s.emberstack.com/reflection-auto-enabled: "true"
  creationTimestamp: "2024-04-16T20:07:50Z"
  labels:
    reflector.v1.k8s.emberstack.com/reflection-allowed: "true"

Here's the CPU and memory usage of reflector

❯ k -n reflector top po --containers                                                                                        🗑️  env-2a
POD                          NAME        CPU(cores)   MEMORY(bytes)
reflector-5bc45489b8-k9g7f   reflector   1423m        332Mi
drewwells commented 2 months ago

I see this happening every 3 seconds. Does this service act like a watcher, watching for changes in the cluster? Can we add labels so it only looks at specific configmaps instead of looking at all of them

[reflector-5bc45489b8-k9g7f] 2024-06-10 16:08:10.306 +00:00 [INF] (ES.Kubernetes.Reflector.Core.ConfigMapWatcher) Session closed. Duration: 00:00:03.3548677. Faulted: False.
[reflector-5bc45489b8-k9g7f] 2024-06-10 16:08:10.306 +00:00 [INF] (ES.Kubernetes.Reflector.Core.ConfigMapWatcher) Requesting V1ConfigMap resources
[reflector-5bc45489b8-k9g7f] 2024-06-10 16:08:13.336 +00:00 [INF] (ES.Kubernetes.Reflector.Core.ConfigMapMirror) Auto-reflected feature-flag/ff-feature-flag where permitted. Created 0 - Updated 0 - Deleted 0 - Validated 299.
[reflector-5bc45489b8-k9g7f] 2024-06-10 16:08:13.343 +00:00 [INF] (ES.Kubernetes.Reflector.Core.ConfigMapWatcher) Session closed. Duration: 00:00:03.0375045. Faulted: False.
[reflector-5bc45489b8-k9g7f] 2024-06-10 16:08:13.343 +00:00 [INF] (ES.Kubernetes.Reflector.Core.ConfigMapWatcher) Requesting V1ConfigMap resources
[reflector-5bc45489b8-k9g7f] 2024-06-10 16:08:15.826 +00:00 [INF] (ES.Kubernetes.Reflector.Core.ConfigMapWatcher) Session closed. Duration: 00:00:02.4830903. Faulted: False.
[reflector-5bc45489b8-k9g7f] 2024-06-10 16:08:15.826 +00:00 [INF] (ES.Kubernetes.Reflector.Core.ConfigMapWatcher) Requesting V1ConfigMap resources
[reflector-5bc45489b8-k9g7f] 2024-06-10 16:08:19.845 +00:00 [INF] (ES.Kubernetes.Reflector.Core.ConfigMapMirror) Auto-reflected feature-flag/ff-feature-flag where permitted. Created 0 - Updated 0 - Deleted 0 - Validated 299.
[reflector-5bc45489b8-k9g7f] 2024-06-10 16:08:19.859 +00:00 [INF] (ES.Kubernetes.Reflector.Core.ConfigMapWatcher) Session closed. Duration: 00:00:04.0327364. Faulted: False.
[reflector-5bc45489b8-k9g7f] 2024-06-10 16:08:19.859 +00:00 [INF] (ES.Kubernetes.Reflector.Core.ConfigMapWatcher) Requesting V1ConfigMap resources
winromulus commented 2 months ago

@drewwells reflector opens a watcher with a default timeout (in k8s) of around 40 minutes. The fact that the connection closes every 3 seconds is extremely odd. I would need to know more about the setup. Also, are you sure you didn't set the timeout to something like 3 seconds in the configuration? Please add more details about the host of k8s, if it's k8s or some other variant etc.

drewwells commented 2 months ago

Nothing special about the cluster, it's running v1.25.10

winromulus commented 2 months ago

@drewwells Is this standard k8s or any other flavor (like k3s or something). Also are you self hosting or using a cloud provider?

drewwells commented 2 months ago

it's deployed with kops and the nodes are hosted on AWS. Hmm, usage is vastly different across clusters. The only thing that is consistent is significant ETCD storage usage like 2x before deploying the service.

# staging environment
❯ k -n reflector top po --containers                                                                                                                   
POD                         NAME        CPU(cores)   MEMORY(bytes)
reflector-c786c5fb4-jmqg9   reflector   4m           163Mi

# dev environment
❯ k -n reflector top po --containers                                                                                                                  
POD                          NAME        CPU(cores)   MEMORY(bytes)
reflector-5bc45489b8-k9g7f   reflector   1154m        322Mi
zzjin commented 2 months ago

Same issue here:

# env1
POD                          NAME        CPU(cores)   MEMORY(bytes)
reflector-5dddff7688-rp6tx   reflector   1403m        205Mi

# env2
POD                          NAME        CPU(cores)   MEMORY(bytes)
reflector-64dcc58c5f-wrh8q   reflector   2636m        370Mi

What I found is that cpu usage is high when there are too many secrets/configmaps.

> kubectl get secrets -A | wc -l

# env1
24121

# env2
88112

Both cluster doing one and only one same thing: copy one given namespace's TLS secret to other namespaces. witch means, env1 have one base TLS secret and about 2W+ reflected secrets and env2 have one and about 8W+ reflected secrets. The base secret is barely changed(90d to upgrade).

  annotations:
    cert-manager.io/alt-names: "*.example.io,example.io"
    cert-manager.io/certificate-name: wildcard-example-io
    cert-manager.io/common-name: example.io
    cert-manager.io/ip-sans: ""
    cert-manager.io/issuer-group: ""
    cert-manager.io/issuer-kind: ClusterIssuer
    cert-manager.io/issuer-name: cluster-issuer-example
    cert-manager.io/uri-sans: ""
    reflector.v1.k8s.emberstack.com/reflection-allowed: "true"
    reflector.v1.k8s.emberstack.com/reflection-allowed-namespaces: \w+-system,\w+-frontend,ns-[\-a-z0-9]*
    reflector.v1.k8s.emberstack.com/reflection-auto-enabled: "true"
    reflector.v1.k8s.emberstack.com/reflection-auto-namespaces: \w+-system,\w+-frontend,ns-[\-a-z0-9]*
  labels:
    controller.cert-manager.io/fao: "true"

IMO, reflector controller only monitor one namespace's secret and copy it to others when changes happen Wonder why cpu is related to cluster's all secret counts?

kubernetes is standard deployed on GCP VM.

drewwells commented 2 months ago

An easy way to limiter the watchers is using labels. Also usage goes up after it creates configmaps or secrets, I don't think it needs to watch generated resources. If people change them, let it be until next sync wipes out those changes.

arjun-beathi commented 1 month ago

In my case, if same secret name in two different namespaces try to sync to common namespace thats when I see reflector closing connection every 4secs.

example: "secretA" from "nsA" and "nsB" trying to sync to "nsC". Easy reproducible issue.