crossplane-contrib / provider-upjet-gcp

GCP Provider for Crossplane.
https://marketplace.upbound.io/providers/upbound/provider-family-gcp/
Apache License 2.0
68 stars 75 forks source link

[Bug]: Significant memory leakage of pubsub provider #489

Closed momilo closed 2 months ago

momilo commented 7 months ago

Is there an existing issue for this?

Affected Resource(s)

xpkg.upbound.io/upbound/provider-gcp-pubsub

Resource MRs required to reproduce the bug

k describe provider provider-gcp-pubsub

Name:         provider-gcp-pubsub
Namespace:
Labels:       tanka.dev/environment=6b159b0c45c19449c60d292b163b9adf688215e421d80049
Annotations:  argocd.argoproj.io/tracking-id: crossplane:pkg.crossplane.io/Provider:crossplane-system/provider-gcp-pubsub
API Version:  pkg.crossplane.io/v1
Kind:         Provider
Metadata:
  Creation Timestamp:  2024-01-19T05:28:36Z
  Generation:          3
  Resource Version:    1067618417
  UID:                 416280af-dc87-4f6e-823a-d221dab8a23f
Spec:
  Ignore Crossplane Constraints:  false
  Package:                        xpkg.upbound.io/upbound/provider-gcp-pubsub:v1.0.1
  Package Pull Policy:            IfNotPresent
  Revision Activation Policy:     Automatic
  Revision History Limit:         1
  Runtime Config Ref:
    API Version:               pkg.crossplane.io/v1beta1
    Kind:                      DeploymentRuntimeConfig
    Name:                      gcp
  Skip Dependency Resolution:  false
Status:
  Conditions:
    Last Transition Time:  2024-03-17T13:31:03Z
    Reason:                HealthyPackageRevision
    Status:                True
    Type:                  Healthy
    Last Transition Time:  2024-01-19T05:28:37Z
    Reason:                ActivePackageRevision
    Status:                True
    Type:                  Installed
  Current Identifier:      xpkg.upbound.io/upbound/provider-gcp-pubsub:v1.0.1
  Current Revision:        provider-gcp-pubsub-13c80f1bb55f
Events:
  Type     Reason                  Age                   From                                 Message
  ----     ------                  ----                  ----                                 -------
  Warning  InstallPackageRevision  3m6s (x12 over 2d2h)  packages/provider.pkg.crossplane.io  current package revision is unhealthy
  Normal   InstallPackageRevision  3m5s (x5 over 2d2h)   packages/provider.pkg.crossplane.io  Successfully installed package revision

k get pod provider-gcp-pubsub-4f8a71eab319-85688d99c-t5pwq -o=yaml -n=crossplane-system

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: b8544cbfc5e54585bf0f56b3df51d476111ddb8f5af86953bc4833e1f4d24515
    cni.projectcalico.org/podIP: <cut>
    cni.projectcalico.org/podIPs: <cut>
  creationTimestamp: "2024-03-17T13:31:01Z"
  generateName: provider-gcp-pubsub-13c80f1bb55f-69c84587fd-
  labels:
    pkg.crossplane.io/provider: provider-gcp-pubsub
    pkg.crossplane.io/revision: provider-gcp-pubsub-13c80f1bb55f
    pod-template-hash: 69c84587fd
  name: provider-gcp-pubsub-13c80f1bb55f-69c84587fd-kjgtt
  namespace: crossplane-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: provider-gcp-pubsub-13c80f1bb55f-69c84587fd
    uid: c0e8043d-c6ed-432b-b7f4-5388846bcbf5
  resourceVersion: "1067618409"
  uid: 65981173-c9bf-4262-9dfa-d5d623b294e4
spec:
  containers:
  - env:
    - name: TLS_CLIENT_CERTS_DIR
      value: /tls/client
    - name: TLS_SERVER_CERTS_DIR
      value: /tls/server
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: ESS_TLS_CERTS_DIR
      value: $(TLS_CLIENT_CERTS_DIR)
    - name: WEBHOOK_TLS_CERT_DIR
      value: $(TLS_SERVER_CERTS_DIR)
    image: xpkg.upbound.io/upbound/provider-gcp-pubsub:v1.0.1
    imagePullPolicy: IfNotPresent
    name: package-runtime
    ports:
    - containerPort: 8080
      name: metrics
      protocol: TCP
    - containerPort: 9443
      name: webhook
      protocol: TCP
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      privileged: false
      runAsGroup: 2000
      runAsNonRoot: true
      runAsUser: 2000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tls/client
      name: tls-client-certs
      readOnly: true
    - mountPath: /tls/server
      name: tls-server-certs
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-jznd7
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: gke-main-main-fb640199-9qpe
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: gke.io/optimize-utilization-scheduler
  securityContext:
    runAsGroup: 2000
    runAsNonRoot: true
    runAsUser: 2000
  serviceAccount: provider-gcp
  serviceAccountName: provider-gcp
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: tls-client-certs
    secret:
      defaultMode: 420
      items:
      - key: tls.crt
        path: tls.crt
      - key: tls.key
        path: tls.key
      - key: ca.crt
        path: ca.crt
      secretName: provider-gcp-pubsub-tls-client
  - name: tls-server-certs
    secret:
      defaultMode: 420
      items:
      - key: tls.crt
        path: tls.crt
      - key: tls.key
        path: tls.key
      - key: ca.crt
        path: ca.crt
      secretName: provider-gcp-pubsub-tls-server
  - name: kube-api-access-jznd7
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-03-17T13:31:01Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-03-17T13:31:03Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-03-17T13:31:03Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-03-17T13:31:01Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://c1c1b45e943b2bbfe3c1171edaa9abc45c3297d3e90c8c274b45d240b1d60a1d
    image: xpkg.upbound.io/upbound/provider-gcp-pubsub:v1.0.1
    imageID: xpkg.upbound.io/upbound/provider-gcp-pubsub@sha256:13c80f1bb55fea8ed1bf6a51e6efbb68094463bcd4ec55451e8b83b4184dc4d0
    lastState: {}
    name: package-runtime
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-03-17T13:31:02Z"
  hostIP: <cut>
  phase: Running
  podIP: <cut>
  podIPs:
  - ip: <cut>
  qosClass: BestEffort
  startTime: "2024-03-17T13:31:01Z"

Steps to Reproduce

  1. Deploy vanilla crossplane via helmchart
  2. Deploy pubsub provider with c. 100 pubsub topics (topics.pubsub.gcp.upbound.io), and c. 100 subscriptions (subscription.pubsub.gcp.upbound.io) + related IAM (c. 100 topiciammembers.pubsub.gcp.upbound.io + c. 100 subscriptioniammembers.pubsub.gcp.upbound.io)
  3. Wait and observe memory increase of the pubsub provider pod.

What happened?

The pubsub provider's pod memory usage grows up to c. 20GB throughout the day (then gets OOM-ed).

This is true for both provider v1.0.1 + crossplane 1.15.1, and for provider v1.0.0 + crossplane 1.15.

Ssh-ing into the pod and running top confirms that all memory is used by the provider application (note - the below screenshots were not taken at the same time, hence the different memory usage reported).

See: image

image

Note that:

  1. other deployments of the crossplane-system (crossplane, crossplane-rbac-manager, and provider-gcp-cloudplatform) are behaving fine under this configuration.
  2. this is occurring under no changes in underlying resources (no new topics/subscriptions being created etc.)
  3. given the above, the fact that memory usage is increasing in a very straight line (i.e. extra MB/min is constant) seems to suggest that the leak might be in the recurring sync-up/reconciliation check flow.

I have not had a chance to recompile the provider with pprof enabled to investigate further.

Relevant Error Output Snippet

Note that the system is otherwise working fine, all topics and subscriptions are shown to be ready and in-sync, and there are no logs produced by the crossplane stack, indicating any issues.

Crossplane Version

1.15.1

Provider Version

1.0.1

Kubernetes Version

v1.28.5-gke.1217000

Kubernetes Distribution

GKE

Additional Info

Alas, my initial hope that plugging-in the debug logging (Issue 471 already kindly addressed) would also resolve the memory leak did not come true :-(.

github-actions[bot] commented 3 months ago

This provider repo does not have enough maintainers to address every issue. Since there has been no activity in the last 90 days it is now marked as stale. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

github-actions[bot] commented 2 months ago

This issue is being closed since there has been no activity for 14 days since marking it as stale. If you still need help, feel free to comment or reopen the issue!

momilo commented 2 months ago

/fresh ?

It's still a problem :-(

momilo commented 2 months ago

@mergenci - sorry for the confusion. Shortly after my follow-up comment this fix went in and was released.

We have deployed the fix and it seems like it has, indeed, addressed this issue as well. We will continue monitoring, but - to the best of my current knowledge - it's no longer a problem.