crossplane-contrib / provider-upjet-gcp

GCP Provider for Crossplane.
https://marketplace.upbound.io/providers/upbound/provider-family-gcp/
Apache License 2.0
68 stars 75 forks source link

[Bug]: Pubsub provider - debug logging permanently enabled + memory leak #471

Closed momilo closed 8 months ago

momilo commented 8 months ago

Is there an existing issue for this?

Affected Resource(s)

xpkg.upbound.io/upbound/provider-gcp-pubsub

Resource MRs required to reproduce the bug

k describe provider provider-gcp-pubsub

Name:         provider-gcp-pubsub
Namespace:
Labels:       tanka.dev/environment=7566117692fec8bf2e6ad5514d3c6cda70d408be5dfe1353
Annotations:  argocd.argoproj.io/tracking-id: crossplane:pkg.crossplane.io/Provider:crossplane-system/provider-gcp-pubsub
API Version:  pkg.crossplane.io/v1
Kind:         Provider
Metadata:
  Creation Timestamp:  2024-01-15T09:04:58Z
  Generation:          2
  Resource Version:    1107021775
  UID:                 ab65a822-d2ac-4108-938f-30b713bedb1f
Spec:
  Ignore Crossplane Constraints:  false
  Package:                        xpkg.upbound.io/upbound/provider-gcp-pubsub:v1.0.0
  Package Pull Policy:            IfNotPresent
  Revision Activation Policy:     Automatic
  Revision History Limit:         1
  Runtime Config Ref:
    API Version:               pkg.crossplane.io/v1beta1
    Kind:                      DeploymentRuntimeConfig
    Name:                      gcp
  Skip Dependency Resolution:  false
Status:
  Conditions:
    Last Transition Time:  2024-03-08T09:26:44Z
    Reason:                HealthyPackageRevision
    Status:                True
    Type:                  Healthy
    Last Transition Time:  2024-01-15T09:04:58Z
    Reason:                ActivePackageRevision
    Status:                True
    Type:                  Installed
  Current Identifier:      xpkg.upbound.io/upbound/provider-gcp-pubsub:v1.0.0
  Current Revision:        provider-gcp-pubsub-4f8a71eab319
Events:                    <none>

k get pod provider-gcp-pubsub-4f8a71eab319-85688d99c-t5pwq -o=yaml -n=crossplane-system

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: a910b9c07bb485a0abe986db3e7e0cd4030b0795588691825cdc4af110bebed4
    cni.projectcalico.org/podIP:  <cut>
    cni.projectcalico.org/podIPs: <cut>
  creationTimestamp: "2024-03-08T08:52:46Z"
  generateName: provider-gcp-pubsub-4f8a71eab319-85688d99c-
  labels:
    pkg.crossplane.io/provider: provider-gcp-pubsub
    pkg.crossplane.io/revision: provider-gcp-pubsub-4f8a71eab319
    pod-template-hash: 85688d99c
  name: provider-gcp-pubsub-4f8a71eab319-85688d99c-t5pwq
  namespace: crossplane-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: provider-gcp-pubsub-4f8a71eab319-85688d99c
    uid: 4d3a25c7-73ac-4d75-9011-d4c6420a0302
  resourceVersion: "1055837254"
  uid: 56e735bf-1031-4294-a733-d31568f54fb3
spec:
  containers:
    - name: TLS_CLIENT_CERTS_DIR
      value: /tls/client
    - name: TLS_SERVER_CERTS_DIR
      value: /tls/server
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: ESS_TLS_CERTS_DIR
      value: $(TLS_CLIENT_CERTS_DIR)
    - name: WEBHOOK_TLS_CERT_DIR
      value: $(TLS_SERVER_CERTS_DIR)
    image: xpkg.upbound.io/upbound/provider-gcp-pubsub:v1.0.0
    imagePullPolicy: IfNotPresent
    name: package-runtime
    ports:
    - containerPort: 8080
      name: metrics
      protocol: TCP
    - containerPort: 9443
      name: webhook
      protocol: TCP
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      privileged: false
      runAsGroup: 2000
      runAsNonRoot: true
      runAsUser: 2000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tls/client
      name: tls-client-certs
      readOnly: true
    - mountPath: /tls/server
      name: tls-server-certs
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-cqtmf
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: gke-main-mycluster-0afadfa0-fafa
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: gke.io/optimize-utilization-scheduler
  securityContext:
    runAsGroup: 2000
    runAsNonRoot: true
    runAsUser: 2000
  serviceAccount: provider-gcp
  serviceAccountName: provider-gcp
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: tls-client-certs
    secret:
      defaultMode: 420
      items:
      - key: tls.crt
        path: tls.crt
      - key: tls.key
        path: tls.key
      - key: ca.crt
        path: ca.crt
      secretName: provider-gcp-pubsub-tls-client
  - name: tls-server-certs
    secret:
      defaultMode: 420
      items:
      - key: tls.crt
        path: tls.crt
      - key: tls.key
        path: tls.key
      - key: ca.crt
        path: ca.crt
      secretName: provider-gcp-pubsub-tls-server
  - name: kube-api-access-cqtmf
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-03-08T08:52:46Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-03-08T08:52:48Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-03-08T08:52:48Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-03-08T08:52:46Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://9d035f08bbc2ae548573654f526ef950be4f3a8f9ae13b681c1896d241bc84c3
    image: xpkg.upbound.io/upbound/provider-gcp-pubsub:v1.0.0
    imageID: xpkg.upbound.io/upbound/provider-gcp-pubsub@sha256:4f8a71eab319a62585488daa3dfb5e771bed172d1d9007f27538c288fb7dca94
    lastState: {}
    name: package-runtime
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-03-08T08:52:47Z"
  hostIP: 10.17.0.46
  phase: Running
  podIP: <cut>
  podIPs:
  - ip: <cut>
  qosClass: BestEffort
  startTime: "2024-03-08T08:52:46Z"

Steps to Reproduce

  1. Deploy vanilla crossplane via helmchart
  2. Deploy pubsub provider with a few topics configured

What happened?

Observed:

  1. The pubsub provider's pod is generating millions (70mln+ per day) of trace/debug-level logs (sample below), despite no --debug flag or any other similar configuration
  2. The pubsub provider's pod memory usage grows up to c. 20GB throughout the day (then gets OOM-ed)
  3. The usually-expected message [controller-runtime] log.SetLogger(...) was never called; logs will not be displayed. does not appear

Example logs:

2024/03/08 16:45:08 [DEBUG] Waiting for state to become: [success]                                                                   
2024/03/08 16:45:08 [DEBUG] Retry Transport: starting RoundTrip retry loop                                                           
2024/03/08 16:45:08 [DEBUG] Retry Transport: request attempt 0                                                                       
2024/03/08 16:45:09 [DEBUG] Retry Transport: Stopping retries, last request was successful                                           
2024/03/08 16:45:09 [DEBUG] Retry Transport: Returning after 1 attempts                                                              
2024/03/08 16:45:09 [INFO] Authenticating using configured Google JSON 'credentials'...                                              
2024/03/08 16:45:09 [INFO]   -- Scopes: [https://www.googleapis.com/auth/cloud-platform https://www.googleapis.com/auth/userinfo.emai
2024/03/08 16:45:09 [INFO] Authenticating using configured Google JSON 'credentials'...                                              
2024/03/08 16:45:09 [INFO]   -- Scopes: [https://www.googleapis.com/auth/cloud-platform https://www.googleapis.com/auth/userinfo.emai
2024/03/08 16:45:09 [DEBUG] Waiting for state to become: [success]  

The logs seem to be originally coming from terraform-provider-google, which I am guessing the crossplane provider is using underneath.

There is no --debug or TF_ENV set anywhere. I do not have any ControllerConfigs. DeploymentRuntimeConfig does not include anything relevant. In fact, I attempted adding TF_ENV=ERROR or TF_ENV=OFF to turn them off, but with no success.

The bug was successfully replicated by a kind soul from Crossplane Slack's channel (see this thread)

Relevant Error Output Snippet

No response

Crossplane Version

1.15

Provider Version

1.0.0

Kubernetes Version

1.28

Kubernetes Distribution

GKE

Additional Info

No response

ulucinar commented 8 months ago

Hi @momilo, Thanks for reporting this issue. We've prepared and validated (with the Topic.pubsub managed resource) a fix for the logging issue reported here. However, I'm not sure of the memory leak issue.

Once the logging fix is available, could you please recheck if the memory leak issue still exists and maybe open a new bug with more details on the memory issue? Thank you very much.

momilo commented 7 months ago

Thank you very much, that is very helpful and it did indeed help us resolve the issue with logging. Alas, the memory leak issue persists - I've allowed myself to open a separate issue :-(.