crossplane-contrib / provider-upjet-aws

Official AWS Provider for Crossplane by Upbound.
https://marketplace.upbound.io/providers/upbound/provider-aws
Apache License 2.0
138 stars 117 forks source link

Pod terminated with OOMKilled #801

Open wdonne opened 1 year ago

wdonne commented 1 year ago

What happened?

The provider pod is often terminated with OOMKilled.

How can we reproduce it?

Run it with the memory limit set to 1Gi.

What environment did it happen in?

Remarks

By default the resources object is empty in the pod. What is the amount memory needed by the provider?

jeanduplessis commented 1 year ago

@wdonne I assume you used the monolithic provider-aws package rather than the family provider ones? If so, then this will be helpful here: https://github.com/upbound/upjet/blob/main/docs/sizing-guide.md

I'll close this issue for now, but feel free to revert back if you still run into issues.

siddharthdeshmukh commented 11 months ago

Hi, Is there any sizing guide for family provider? I am getting OOMKilled for provider-aws-ec2 provider with following resources

resources:
    limits:
      memory: 4Gi
    requests:
      cpu: 500m
      memory: 1Gi
wdonne commented 10 months ago

I have switched to the AWS family providers, but my provider-aws-iam gets OOMKilled, while I gave it 1.5GB. We can't give all members of the family that much memory. It would result in a huge cluster.

gmykhailiuta commented 10 months ago

We have 10 cores 12GB for aws-ec2 alone and it's not enough:

observe failed: cannot run refresh: refresh failed: failed to read
        schema for aws_eip.test-nat-gateway in
        registry.terraform.io/hashicorp/aws: failed to instantiate provider
        "registry.terraform.io/hashicorp/aws" to obtain schema: timeout while
        waiting for plugin to start: 

@wdonne, would you mind re-opening the issue?

Update: controllerConfig arguments:

                      - --debug
                      - --enable-external-secret-stores
                      - --provider-ttl=250
                      - --max-reconcile-rate=10
                      - --sync=4h
                      - --poll=1h
                      - --terraform-native-provider-path
                      - ""
wdonne commented 10 months ago

@gmykhailiuta I don't have the permission to reopen the issue because I didn't close if myself.

@jeanduplessis Would you reopen it? I think there clearly is a memory problem.

turkenf commented 10 months ago

@wdonne, @siddharthdeshmukh, @gmykhailiuta I cannot reproduce the issue in v0.40.0 with the information provided. Does this issue always occur, and which versions are you using? and provide us the full ControllerConfig please.

gmykhailiuta commented 10 months ago

Thank you for taking care of it, @turkenf !

This issue seems to get worse the more resources we migrate to Upbound providers (ec2, iam, eks, route53). Ec2 is currently one of the most heavily used. It manages 238 resources. Could you try with 250+ resources please?

I've also noticed that most load comes in the first few minutes following provider pod's creation like if all resources are polled at once, then it gets more even. It would be nice to introduce a random delay per each resource.

ControllerConfig spec used by aws-ec2 provider:

apiVersion: pkg.crossplane.io/v1alpha1
kind: ControllerConfig
metadata:
  annotations:
    argocd.argoproj.io/sync-options: Validate=false
    argocd.argoproj.io/sync-wave: "-1"
    eks.amazonaws.com/role-arn: arn:aws:iam::0123456789:role/crossplane-provider-aws
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"pkg.crossplane.io/v1alpha1","kind":"ControllerConfig","metadata":{"annotations":{"argocd.argoproj.io/sync-wave":"-1","eks.amazonaws.com/role-arn":"arn:aws:iam::0123456789:role/crossplane-provider-aws"},"name":"crossplane-provider-aws-xl"},"spec":{"args":["--debug","--enable-external-secret-stores","--provider-ttl=250","--max-reconcile-rate=10","--terraform-native-provider-path",""],"podSecurityContext":{"fsGroup":2000},"resources":{"limits":{"cpu":"16000m","memory":"12000Mi"},"requests":{"cpu":"14000m","memory":"12000Mi"}}}}
  creationTimestamp: "2023-08-29T17:43:48Z"
  generation: 13
  labels:
    app.kubernetes.io/component: crossplane
    app.kubernetes.io/instance: provider-aws-xl
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: crossplane
    argocd.argoproj.io/instance: infra-crossplane
    helm.sh/chart: crossplane-1.12.2
  name: crossplane-provider-aws-xl
  resourceVersion: "997346283"
  uid: 41119bd7-f1ef-41f2-bb3e-31aff4a896cd
spec:
  args:
  - --debug
  - --enable-external-secret-stores
  - --provider-ttl=250
  - --max-reconcile-rate=10
  - --sync=48h
  - --poll=48h
  - --terraform-native-provider-path
  - ""
  podSecurityContext:
    fsGroup: 2000
  resources:
    limits:
      cpu: 10000m
      memory: 20000Mi
    requests:
      cpu: 8000m
      memory: 18000Mi

Versions in use:

wdonne commented 10 months ago

Hello @turkenf ,

This is the ControllerConfig resource:

apiVersion: pkg.crossplane.io/v1alpha1
kind: ControllerConfig
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"pkg.crossplane.io/v1alpha1","kind":"ControllerConfig","metadata":{"annotations":{},"labels":{"app.kubernetes.io/instance":"crossplane-providers"},"name":"iam-config"},"spec":{"resources":{"limits":{"cpu":"100m","ephemeral-storage":"512Mi","memory":"1.5Gi"}}}}
  creationTimestamp: "2023-08-30T16:26:43Z"
  generation: 2
  labels:
    app.kubernetes.io/instance: crossplane-providers
  name: iam-config
  resourceVersion: "167225794"
  uid: 4ffe441f-9daa-493c-8f21-23c41eb17cb9
spec:
  resources:
    limits:
      cpu: 100m
      ephemeral-storage: 512Mi
      memory: 1.5Gi

and this is the live manifest of the IAM provider pod:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2023-09-08T11:10:12Z"
  generateName: provider-aws-iam-e4667f1b5f01-6c49d6d4f5-
  labels:
    pkg.crossplane.io/provider: provider-aws-iam
    pkg.crossplane.io/revision: provider-aws-iam-e4667f1b5f01
    pod-template-hash: 6c49d6d4f5
  name: provider-aws-iam-e4667f1b5f01-6c49d6d4f5-4j6fr
  namespace: crossplane-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: provider-aws-iam-e4667f1b5f01-6c49d6d4f5
    uid: 618d039d-1249-42c7-9952-2055386a1266
  resourceVersion: "182253284"
  uid: 58aa7444-ca94-4d3a-a689-89d50335ddc6
spec:
  containers:
  - env:
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: WEBHOOK_TLS_CERT_DIR
      value: /webhook/tls
    image: xpkg.upbound.io/upbound/provider-aws-iam:v0.39.0
    imagePullPolicy: IfNotPresent
    name: provider-aws-iam
    ports:
    - containerPort: 8080
      name: metrics
      protocol: TCP
    - containerPort: 9443
      name: webhook
      protocol: TCP
    resources:
      limits:
        cpu: 100m
        ephemeral-storage: 512Mi
        memory: 1536Mi
      requests:
        cpu: 100m
        ephemeral-storage: 512Mi
        memory: 1536Mi
    securityContext:
      allowPrivilegeEscalation: false
      privileged: false
      runAsGroup: 2000
      runAsNonRoot: true
      runAsUser: 2000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /webhook/tls
      name: webhook-tls-secret
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-rn4l6
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-192-168-151-92.eu-central-1.compute.internal
  nodeSelector:
    kubernetes.io/arch: arm64
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    runAsGroup: 2000
    runAsNonRoot: true
    runAsUser: 2000
  serviceAccount: provider-aws-iam-e4667f1b5f01
  serviceAccountName: provider-aws-iam-e4667f1b5f01
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: webhook-tls-secret
    secret:
      defaultMode: 420
      items:
      - key: tls.crt
        path: tls.crt
      - key: tls.key
        path: tls.key
      secretName: webhook-tls-secret
  - name: kube-api-access-rn4l6
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-09-08T11:10:12Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-09-12T08:18:32Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-09-12T08:18:32Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-09-08T11:10:12Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://42e19daae04d12d7241d354c4976d516bdc50bf893b9ec686fbebc00ce290645
    image: xpkg.upbound.io/upbound/provider-aws-iam:v0.39.0
    imageID: xpkg.upbound.io/upbound/provider-aws-iam@sha256:e4667f1b5f015718d4e39d4005dbf002e91fa34b683439da7e2fe232ce688175
    lastState:
      terminated:
        containerID: containerd://725e32a4691d251af681ba5e1efbd5f78e3a7dfdbf42456281ec55b8f632189e
        exitCode: 137
        finishedAt: "2023-09-12T08:15:37Z"
        reason: OOMKilled
        startedAt: "2023-09-12T08:12:49Z"
    name: provider-aws-iam
    ready: true
    restartCount: 334
    started: true
    state:
      running:
        startedAt: "2023-09-12T08:18:31Z"
  hostIP: 192.168.151.92
  phase: Running
  podIP: 192.168.128.135
  podIPs:
  - ip: 192.168.128.135
  qosClass: Guaranteed
  startTime: "2023-09-08T11:10:12Z"

Note also that the provider currently manages only 19 resources.

wdonne commented 10 months ago

I now upgraded to version 0.40.0 and the problem still occurs.

turkenf commented 10 months ago

@wdonne in your case, I think the error you are getting is due to your CPU limit. Since a low CPU value causes more memory consumption, your pod is terminated with OOMKilled. My test pod, which manages more than 50 resources, is working properly with the following values:

  resources:
    limits:
      cpu: 2
      ephemeral-storage: 512Mi
      memory: 1.5Gi
> k get pods -n crossplane-system provider-aws-iam-d4638f0c5651-7bbfd5cd57-qwdgm -o yaml -w
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2023-09-12T12:50:58Z"
  generateName: provider-aws-iam-d4638f0c5651-7bbfd5cd57-
  labels:
    pkg.crossplane.io/provider: provider-aws-iam
    pkg.crossplane.io/revision: provider-aws-iam-d4638f0c5651
    pod-template-hash: 7bbfd5cd57
  name: provider-aws-iam-d4638f0c5651-7bbfd5cd57-qwdgm
  namespace: crossplane-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: provider-aws-iam-d4638f0c5651-7bbfd5cd57
    uid: 913342c9-5771-4639-a016-4d1ffa6e3de1
  resourceVersion: "20745"
  uid: 2f2195dd-4617-4918-9770-b4d10ebbd129
spec:
  containers:
  - env:
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: WEBHOOK_TLS_CERT_DIR
      value: /webhook/tls
    image: xpkg.upbound.io/upbound/provider-aws-iam:v0.40.0
    imagePullPolicy: IfNotPresent
    name: provider-aws-iam
    ports:
    - containerPort: 8080
      name: metrics
      protocol: TCP
    - containerPort: 9443
      name: webhook
      protocol: TCP
    resources:
      limits:
        cpu: "2"
        ephemeral-storage: 512Mi
        memory: 1536Mi
      requests:
        cpu: "2"
        ephemeral-storage: 512Mi
        memory: 1536Mi
...
..
.
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-09-12T12:50:58Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-09-12T12:50:59Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-09-12T12:50:59Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-09-12T12:50:58Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://b2c900a2c33ba21c578d71e25fe51100e9250683588a4061ff00807579a4f09f
    image: xpkg.upbound.io/upbound/provider-aws-iam:v0.40.0
    imageID: xpkg.upbound.io/upbound/provider-aws-iam@sha256:d4638f0c56511b0d3bbcf565da6e8e14b6304168eb252573413182de85f7f2a8
    lastState: {}
    name: provider-aws-iam
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-09-12T12:50:58Z"
  hostIP: 172.18.0.2
  phase: Running
  podIP: 10.244.0.9
  podIPs:
  - ip: 10.244.0.9
  qosClass: Guaranteed
  startTime: "2023-09-12T12:50:58Z"

Can you try again by increasing the CPU and let us know?

turkenf commented 10 months ago

@gmykhailiuta I couldn't test your situation, but I think it can be solved by trying different variations with limits. I also recommend you to check here for more detailed information about the tests performed.

Please test it in the latest provider version and If you still think there is a problem, share the results with us in detail.

wdonne commented 10 months ago

@turkenf I first gave it 1 CPU and then 2. In both cases it was OOMKilled after a few seconds, so much faster.

turkenf commented 10 months ago

@wdonne could you please share us provider's log(with debug logs enabled) and output of kubectl get managed?

wdonne commented 10 months ago

@turkenf How can I set the loglevel to debug?

This is the log I have so far:

[controller-runtime] log.SetLogger(...) was never called, logs will not be displayed:
goroutine 1 [running]:
runtime/debug.Stack()
    runtime/debug/stack.go:24 +0x64
sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
    sigs.k8s.io/controller-runtime@v0.15.0/pkg/log/log.go:59 +0x104
sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).WithValues(0x40000ce780, {0x4001bb1000, 0x2, 0x2})
    sigs.k8s.io/controller-runtime@v0.15.0/pkg/log/deleg.go:168 +0x3c
github.com/go-logr/logr.Logger.WithValues(...)
    github.com/go-logr/logr@v1.2.4/logr.go:323
sigs.k8s.io/controller-runtime/pkg/builder.(*Builder).doController(0x4001b45b00, {0x4102ae0, 0x40079bff40})
    sigs.k8s.io/controller-runtime@v0.15.0/pkg/builder/controller.go:384 +0x314
sigs.k8s.io/controller-runtime/pkg/builder.(*Builder).Build(0x4001b45b00, {0x4102ae0?, 0x40079bff40?})
    sigs.k8s.io/controller-runtime@v0.15.0/pkg/builder/controller.go:239 +0x40
sigs.k8s.io/controller-runtime/pkg/builder.(*Builder).Complete(...)
    sigs.k8s.io/controller-runtime@v0.15.0/pkg/builder/controller.go:222
github.com/upbound/provider-aws/internal/controller/iam/accesskey.Setup({0x4148a30, 0x40002a6820}, {{{0x4131680, 0x400000c108}, {0x4130ff0, 0x40005ffe10}, 0x8bb2c97000, 0xa, 0x4001bb0aa0, 0x0}, ...})
    github.com/upbound/provider-aws/internal/controller/iam/accesskey/zz_controller.go:58 +0x10b4
github.com/upbound/provider-aws/internal/controller.Setup_iam({0x4148a30, 0x40002a6820}, {{{0x4131680, 0x400000c108}, {0x4130ff0, 0x40005ffe10}, 0x8bb2c97000, 0xa, 0x4001bb0aa0, 0x0}, ...})
    github.com/upbound/provider-aws/internal/controller/zz_iam_setup.go:65 +0x1f8
main.main()
    github.com/upbound/provider-aws/cmd/provider/iam/zz_main.go:147 +0x2a9c
{"level":"info","ts":"2023-09-14T11:58:27Z","logger":"provider-aws","msg":"Native Terraform provider process error","handle":"d8a857988e269bd8df1d6153cb94773d7b22a652b6cca647536414f804da4141","ttl":100,"ttlMargin":0.1,"nativeProviderPath":"/terraform/provider-mirror/registry.terraform.io/hashicorp/aws/4.67.0/linux_arm64/terraform-provider-aws_v4.67.0_x5","nativeProviderArgs":[],"error":"signal: killed"}
Stream closed EOF for crossplane-system/provider-aws-iam-d4638f0c5651-86b94bf94b-9ths5 (provider-aws-iam)

And this is the output of kubectl get managed:

NAME                                                                  READY   SYNCED   EXTERNAL-NAME                               AGE
policy.iam.aws.upbound.io/cert-manager-policy                         True    False    cert-manager-policy                         15d
policy.iam.aws.upbound.io/clb-tst-secrets-manager-policy              True    False    clb-tst-secrets-manager-policy              15d
policy.iam.aws.upbound.io/kaleido-tst-secrets-manager-policy          True    False    kaleido-tst-secrets-manager-policy          15d
policy.iam.aws.upbound.io/lemonade-traefik-oidc-delegate-ecr-policy   True    False    lemonade-traefik-oidc-delegate-ecr-policy   15d
policy.iam.aws.upbound.io/tooling-secrets-manager-policy              True    False    tooling-secrets-manager-policy              15d
policy.iam.aws.upbound.io/value-injector-ecr-policy                   True    False    value-injector-ecr-policy                   15d
policy.iam.aws.upbound.io/zorglink-tst-secrets-manager-policy         True    False    zorglink-tst-secrets-manager-policy         15d

NAME                                                                                    READY   SYNCED   EXTERNAL-NAME                                                        AGE
rolepolicyattachment.iam.aws.upbound.io/cert-manager-attachment                         True    False    cert-manager-role-20230830100923301400000004                         15d
rolepolicyattachment.iam.aws.upbound.io/clb-tst-secrets-manager-attachment              False   False                                                                         15d
rolepolicyattachment.iam.aws.upbound.io/crossplane-ec2-attachment                       True    False    crossplane-ec2-role-20230830100842783600000002                       15d
rolepolicyattachment.iam.aws.upbound.io/crossplane-rds-attachment                       True    False    crossplane-rds-role-20230830100837055200000001                       15d
rolepolicyattachment.iam.aws.upbound.io/kaleido-tst-secrets-manager-attachment          False   False                                                                         15d
rolepolicyattachment.iam.aws.upbound.io/lemonade-traefik-oidc-delegate-ecr-attachment   True    False    lemonade-traefik-oidc-delegate-ecr-role-20230830100909852300000003   15d
rolepolicyattachment.iam.aws.upbound.io/tooling-secrets-manager-attachment              True    False                                                                         15d
rolepolicyattachment.iam.aws.upbound.io/value-injector-ecr-attachment                   True    False    value-injector-ecr-role-20230830101000179800000005                   15d
rolepolicyattachment.iam.aws.upbound.io/zorglink-tst-secrets-manager-attachment         True    False    zorglink-tst-secrets-manager-role-20230913195710729200000001         15d

NAME                                                              READY   SYNCED   EXTERNAL-NAME                             AGE
role.iam.aws.upbound.io/cert-manager-role                         True    False    cert-manager-role                         15d
role.iam.aws.upbound.io/clb-tst-secrets-manager-role              True    False    clb-tst-secrets-manager-role              15d
role.iam.aws.upbound.io/crossplane-ec2-role                       True    False    crossplane-ec2-role                       15d
role.iam.aws.upbound.io/crossplane-rds-role                       True    False    crossplane-rds-role                       15d
role.iam.aws.upbound.io/kaleido-tst-secrets-manager-role          True    False    kaleido-tst-secrets-manager-role          15d
role.iam.aws.upbound.io/lemonade-traefik-oidc-delegate-ecr-role   True    False    lemonade-traefik-oidc-delegate-ecr-role   15d
role.iam.aws.upbound.io/tooling-secrets-manager-role              True    False    tooling-secrets-manager-role              15d
role.iam.aws.upbound.io/value-injector-ecr-role                   True    False    value-injector-ecr-role                   15d
role.iam.aws.upbound.io/zorglink-tst-secrets-manager-role         True    False    zorglink-tst-secrets-manager-role         15d
siddharthdeshmukh commented 10 months ago

Hi, In my case the issue was fixed by increasing the memory limits. Provider version: 0.37.0 Here is the controller config for the same

apiVersion: pkg.crossplane.io/v1alpha1
kind: ControllerConfig
metadata:
  name: upbound-provider-aws-ec2-controller-config
spec:
  replicas: 1
  resources:
    limits:
      memory: 8Gi
    requests:
      cpu: 2000m
      memory: 4Gi
  securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL
    readOnlyRootFilesystem: false
    runAsNonRoot: true
  serviceAccountName: upbound-provider-aws-ec2-sa

But now I am facing another issue where aws-ec2-provider ends up using all the Available CPU on nodes. Not sure if having limits on CPU will have any negative impact on the provider.

We also ended up using more resources on the node when we migrated to provider family instead of monolith provider.

In our case currently we are using following providers

Kampe commented 9 months ago

Seeing the same with the IAM aws family provider v0.41.0

github-actions[bot] commented 3 months ago

This provider repo does not have enough maintainers to address every issue. Since there has been no activity in the last 90 days it is now marked as stale. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.