crossplane-contrib / provider-argocd

Crossplane provider to provision and manage Argo CD objects
Apache License 2.0
68 stars 35 forks source link

Potential Memory Leak in v0.5.0 #138

Closed jefflantz closed 6 months ago

jefflantz commented 7 months ago

What happened?

For context, my control plane exists in an environment that cannot directly hit our ArgoCD server endpoints, so I run a deployment on the cluster running ArgoCD to run the provider indirectly. What I've observed is that the memory allocated to this pod is increasing at roughly 1.3GB/day, consistent across four environments. I've checked using kubectl top pods and our stored metrics graphs:

k top pod argocd-provider-controller-xxxx -n argocd
NAME                                                         CPU(cores)   MEMORY(bytes)   
argocd-provider-controller-xxxx   15m          22729Mi  

image This is leading to pods being evicted for putting MemoryPressure on the underlying node.

How can we reproduce it?

First I manually create a ControllerConfig and Provider

apiVersion: pkg.crossplane.io/v1alpha1
kind: ControllerConfig
metadata:
  name: provider-argocd-xxxxx-xxxxx
spec:
  replicas: 0
---
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-argocd-yyyyy-yyyyy
spec:
  controllerConfigRef:
    name: provider-argocd-xxxxx-xxxxx
  package: xpkg.upbound.io/crossplane-contrib/provider-argocd:v0.5.0

There is a secret argocd-credentials with a token to ArgoCD at the key authToken, referenced by this ProviderConfig

apiVersion: argocd.crossplane.io/v1alpha1
kind: ProviderConfig
metadata:
  name: argocd
spec:
  serverAddr: <addr>
  insecure: true
  plainText: false
  credentials:
    source: Secret
    secretRef:
      namespace: default
      name: argocd-credentials
      key: authToken

In the cluster running ArgoCD, I'm using the following deployment spec for the controller:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-provider-controller-<control plane name>
  namespace: argocd
  annotations:
    uptest.upbound.io/conditions: "Available=True"
    uptest.upbound.io/pre-delete-hook: testhooks/delete.sh
  labels:
    app: provider-argocd
spec:
  replicas: 1
  selector:
    matchLabels:
      app: provider-argocd
  template:
    metadata:
      labels:
        app: provider-argocd
    spec:
      containers:
        - name: server
          image: xpkg.upbound.io/crossplane-contrib/provider-argocd:v0.5.0
          imagePullPolicy: IfNotPresent
          args:
            - --debug
          env:
            - name: DO_NOTHING
              value: "false"
            - name: KUBECONFIG
              value: /etc/mcp-kubeconfig/kubeconfig
          volumeMounts:
            - name: mcp-kubeconfig
              mountPath: /etc/mcp-kubeconfig
      volumes:
        - name: mcp-kubeconfig
          secret:
            secretName: mcp-kubeconfig-<control plane name>

What environment did it happen in?

Crossplane version: 1.14.3-up.1 (note this is running in Upbound SaaS) Crossplane Provider argocd version: v0.5.0

The deployments are running in EKS, and I've seen it on clusters running EKS versions v1.24 and v1.28.

haarchri commented 7 months ago

@MisterMX @janwillies any ideas ?

MisterMX commented 7 months ago

We haven't found anything that helped indicating the cause of this issue. However, we also did not encounter this problem in our system.

v0lkc commented 7 months ago

This could be caused by the argocd api clients which never get closed. By keeping them up forever they might add more and more data to the memory. I would suggest you to start looking into this.

https://github.com/crossplane-contrib/provider-argocd/blob/5423b0899eb4e296ba9f995cb496d72581d3b0fe/pkg/clients/cluster/client.go#L33

https://github.com/crossplane-contrib/provider-argocd/blob/5423b0899eb4e296ba9f995cb496d72581d3b0fe/pkg/clients/clusters/client.go#L33

https://github.com/crossplane-contrib/provider-argocd/blob/5423b0899eb4e296ba9f995cb496d72581d3b0fe/pkg/clients/projects/client.go#L32

https://github.com/crossplane-contrib/provider-argocd/blob/5423b0899eb4e296ba9f995cb496d72581d3b0fe/pkg/clients/applications/client.go#L38

https://github.com/crossplane-contrib/provider-argocd/blob/5423b0899eb4e296ba9f995cb496d72581d3b0fe/pkg/clients/repositories/client.go#L34

https://github.com/crossplane-contrib/provider-argocd/blob/5423b0899eb4e296ba9f995cb496d72581d3b0fe/pkg/clients/applicationsets/client.go#L28

maximilianbraun commented 7 months ago

Adding some additional evidence to what @v0lkc says.

I added profiling to v0.5.0 & v0.6.0 and the picture is quite similar. After around 20 mins running the controller with a sync-interval of 5s and 5 MRs, the size of the buffers of the http client more then doubled.

profile002

What @v0lkc send me in slack was also the .Close() within argocd's http client are only closed on error. https://github.com/argoproj/argo-cd/blob/0c8bc1d61e8c9501c5aaabb2aafecc20aa43e1bb/util/http/http.go#L147-L151

in the DumpReponse() example its proposed differently.

Maybe we should open an issue over there and close our clients?

MisterMX commented 7 months ago

connector.Disconnect() might be usefull for that.

If somebody has time, feel free to open a PR and assign me as reviewer.

cychiang commented 7 months ago

Hi @MisterMX I sent a PR for this fix, please have a look and see if it really solve the issue. 😃