Open wdonne opened 1 year ago
@wdonne I assume you used the monolithic provider-aws package rather than the family provider ones? If so, then this will be helpful here: https://github.com/upbound/upjet/blob/main/docs/sizing-guide.md
I'll close this issue for now, but feel free to revert back if you still run into issues.
Hi,
Is there any sizing guide for family provider? I am getting OOMKilled for provider-aws-ec2
provider with following resources
resources:
limits:
memory: 4Gi
requests:
cpu: 500m
memory: 1Gi
I have switched to the AWS family providers, but my provider-aws-iam
gets OOMKilled, while I gave it 1.5GB. We can't give all members of the family that much memory. It would result in a huge cluster.
We have 10 cores 12GB for aws-ec2 alone and it's not enough:
observe failed: cannot run refresh: refresh failed: failed to read
schema for aws_eip.test-nat-gateway in
registry.terraform.io/hashicorp/aws: failed to instantiate provider
"registry.terraform.io/hashicorp/aws" to obtain schema: timeout while
waiting for plugin to start:
@wdonne, would you mind re-opening the issue?
Update: controllerConfig arguments:
- --debug
- --enable-external-secret-stores
- --provider-ttl=250
- --max-reconcile-rate=10
- --sync=4h
- --poll=1h
- --terraform-native-provider-path
- ""
@gmykhailiuta I don't have the permission to reopen the issue because I didn't close if myself.
@jeanduplessis Would you reopen it? I think there clearly is a memory problem.
@wdonne, @siddharthdeshmukh, @gmykhailiuta I cannot reproduce the issue in v0.40.0 with the information provided. Does this issue always occur, and which versions are you using? and provide us the full ControllerConfig
please.
Thank you for taking care of it, @turkenf !
This issue seems to get worse the more resources we migrate to Upbound providers (ec2, iam, eks, route53). Ec2 is currently one of the most heavily used. It manages 238 resources. Could you try with 250+ resources please?
I've also noticed that most load comes in the first few minutes following provider pod's creation like if all resources are polled at once, then it gets more even. It would be nice to introduce a random delay per each resource.
ControllerConfig spec used by aws-ec2 provider:
apiVersion: pkg.crossplane.io/v1alpha1
kind: ControllerConfig
metadata:
annotations:
argocd.argoproj.io/sync-options: Validate=false
argocd.argoproj.io/sync-wave: "-1"
eks.amazonaws.com/role-arn: arn:aws:iam::0123456789:role/crossplane-provider-aws
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"pkg.crossplane.io/v1alpha1","kind":"ControllerConfig","metadata":{"annotations":{"argocd.argoproj.io/sync-wave":"-1","eks.amazonaws.com/role-arn":"arn:aws:iam::0123456789:role/crossplane-provider-aws"},"name":"crossplane-provider-aws-xl"},"spec":{"args":["--debug","--enable-external-secret-stores","--provider-ttl=250","--max-reconcile-rate=10","--terraform-native-provider-path",""],"podSecurityContext":{"fsGroup":2000},"resources":{"limits":{"cpu":"16000m","memory":"12000Mi"},"requests":{"cpu":"14000m","memory":"12000Mi"}}}}
creationTimestamp: "2023-08-29T17:43:48Z"
generation: 13
labels:
app.kubernetes.io/component: crossplane
app.kubernetes.io/instance: provider-aws-xl
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/part-of: crossplane
argocd.argoproj.io/instance: infra-crossplane
helm.sh/chart: crossplane-1.12.2
name: crossplane-provider-aws-xl
resourceVersion: "997346283"
uid: 41119bd7-f1ef-41f2-bb3e-31aff4a896cd
spec:
args:
- --debug
- --enable-external-secret-stores
- --provider-ttl=250
- --max-reconcile-rate=10
- --sync=48h
- --poll=48h
- --terraform-native-provider-path
- ""
podSecurityContext:
fsGroup: 2000
resources:
limits:
cpu: 10000m
memory: 20000Mi
requests:
cpu: 8000m
memory: 18000Mi
Versions in use:
Hello @turkenf ,
This is the ControllerConfig resource:
apiVersion: pkg.crossplane.io/v1alpha1
kind: ControllerConfig
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"pkg.crossplane.io/v1alpha1","kind":"ControllerConfig","metadata":{"annotations":{},"labels":{"app.kubernetes.io/instance":"crossplane-providers"},"name":"iam-config"},"spec":{"resources":{"limits":{"cpu":"100m","ephemeral-storage":"512Mi","memory":"1.5Gi"}}}}
creationTimestamp: "2023-08-30T16:26:43Z"
generation: 2
labels:
app.kubernetes.io/instance: crossplane-providers
name: iam-config
resourceVersion: "167225794"
uid: 4ffe441f-9daa-493c-8f21-23c41eb17cb9
spec:
resources:
limits:
cpu: 100m
ephemeral-storage: 512Mi
memory: 1.5Gi
and this is the live manifest of the IAM provider pod:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2023-09-08T11:10:12Z"
generateName: provider-aws-iam-e4667f1b5f01-6c49d6d4f5-
labels:
pkg.crossplane.io/provider: provider-aws-iam
pkg.crossplane.io/revision: provider-aws-iam-e4667f1b5f01
pod-template-hash: 6c49d6d4f5
name: provider-aws-iam-e4667f1b5f01-6c49d6d4f5-4j6fr
namespace: crossplane-system
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: provider-aws-iam-e4667f1b5f01-6c49d6d4f5
uid: 618d039d-1249-42c7-9952-2055386a1266
resourceVersion: "182253284"
uid: 58aa7444-ca94-4d3a-a689-89d50335ddc6
spec:
containers:
- env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: WEBHOOK_TLS_CERT_DIR
value: /webhook/tls
image: xpkg.upbound.io/upbound/provider-aws-iam:v0.39.0
imagePullPolicy: IfNotPresent
name: provider-aws-iam
ports:
- containerPort: 8080
name: metrics
protocol: TCP
- containerPort: 9443
name: webhook
protocol: TCP
resources:
limits:
cpu: 100m
ephemeral-storage: 512Mi
memory: 1536Mi
requests:
cpu: 100m
ephemeral-storage: 512Mi
memory: 1536Mi
securityContext:
allowPrivilegeEscalation: false
privileged: false
runAsGroup: 2000
runAsNonRoot: true
runAsUser: 2000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /webhook/tls
name: webhook-tls-secret
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-rn4l6
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: ip-192-168-151-92.eu-central-1.compute.internal
nodeSelector:
kubernetes.io/arch: arm64
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsGroup: 2000
runAsNonRoot: true
runAsUser: 2000
serviceAccount: provider-aws-iam-e4667f1b5f01
serviceAccountName: provider-aws-iam-e4667f1b5f01
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: webhook-tls-secret
secret:
defaultMode: 420
items:
- key: tls.crt
path: tls.crt
- key: tls.key
path: tls.key
secretName: webhook-tls-secret
- name: kube-api-access-rn4l6
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-09-08T11:10:12Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2023-09-12T08:18:32Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2023-09-12T08:18:32Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2023-09-08T11:10:12Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://42e19daae04d12d7241d354c4976d516bdc50bf893b9ec686fbebc00ce290645
image: xpkg.upbound.io/upbound/provider-aws-iam:v0.39.0
imageID: xpkg.upbound.io/upbound/provider-aws-iam@sha256:e4667f1b5f015718d4e39d4005dbf002e91fa34b683439da7e2fe232ce688175
lastState:
terminated:
containerID: containerd://725e32a4691d251af681ba5e1efbd5f78e3a7dfdbf42456281ec55b8f632189e
exitCode: 137
finishedAt: "2023-09-12T08:15:37Z"
reason: OOMKilled
startedAt: "2023-09-12T08:12:49Z"
name: provider-aws-iam
ready: true
restartCount: 334
started: true
state:
running:
startedAt: "2023-09-12T08:18:31Z"
hostIP: 192.168.151.92
phase: Running
podIP: 192.168.128.135
podIPs:
- ip: 192.168.128.135
qosClass: Guaranteed
startTime: "2023-09-08T11:10:12Z"
Note also that the provider currently manages only 19 resources.
I now upgraded to version 0.40.0 and the problem still occurs.
@wdonne in your case, I think the error you are getting is due to your CPU limit. Since a low CPU value causes more memory consumption, your pod is terminated with OOMKilled. My test pod, which manages more than 50 resources, is working properly with the following values:
resources:
limits:
cpu: 2
ephemeral-storage: 512Mi
memory: 1.5Gi
> k get pods -n crossplane-system provider-aws-iam-d4638f0c5651-7bbfd5cd57-qwdgm -o yaml -w
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2023-09-12T12:50:58Z"
generateName: provider-aws-iam-d4638f0c5651-7bbfd5cd57-
labels:
pkg.crossplane.io/provider: provider-aws-iam
pkg.crossplane.io/revision: provider-aws-iam-d4638f0c5651
pod-template-hash: 7bbfd5cd57
name: provider-aws-iam-d4638f0c5651-7bbfd5cd57-qwdgm
namespace: crossplane-system
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: provider-aws-iam-d4638f0c5651-7bbfd5cd57
uid: 913342c9-5771-4639-a016-4d1ffa6e3de1
resourceVersion: "20745"
uid: 2f2195dd-4617-4918-9770-b4d10ebbd129
spec:
containers:
- env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: WEBHOOK_TLS_CERT_DIR
value: /webhook/tls
image: xpkg.upbound.io/upbound/provider-aws-iam:v0.40.0
imagePullPolicy: IfNotPresent
name: provider-aws-iam
ports:
- containerPort: 8080
name: metrics
protocol: TCP
- containerPort: 9443
name: webhook
protocol: TCP
resources:
limits:
cpu: "2"
ephemeral-storage: 512Mi
memory: 1536Mi
requests:
cpu: "2"
ephemeral-storage: 512Mi
memory: 1536Mi
...
..
.
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-09-12T12:50:58Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2023-09-12T12:50:59Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2023-09-12T12:50:59Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2023-09-12T12:50:58Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://b2c900a2c33ba21c578d71e25fe51100e9250683588a4061ff00807579a4f09f
image: xpkg.upbound.io/upbound/provider-aws-iam:v0.40.0
imageID: xpkg.upbound.io/upbound/provider-aws-iam@sha256:d4638f0c56511b0d3bbcf565da6e8e14b6304168eb252573413182de85f7f2a8
lastState: {}
name: provider-aws-iam
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2023-09-12T12:50:58Z"
hostIP: 172.18.0.2
phase: Running
podIP: 10.244.0.9
podIPs:
- ip: 10.244.0.9
qosClass: Guaranteed
startTime: "2023-09-12T12:50:58Z"
Can you try again by increasing the CPU and let us know?
@gmykhailiuta I couldn't test your situation, but I think it can be solved by trying different variations with limits. I also recommend you to check here for more detailed information about the tests performed.
Please test it in the latest provider version and If you still think there is a problem, share the results with us in detail.
@turkenf I first gave it 1 CPU and then 2. In both cases it was OOMKilled after a few seconds, so much faster.
@wdonne could you please share us provider's log(with debug logs enabled) and output of kubectl get managed
?
@turkenf How can I set the loglevel to debug?
This is the log I have so far:
[controller-runtime] log.SetLogger(...) was never called, logs will not be displayed:
goroutine 1 [running]:
runtime/debug.Stack()
runtime/debug/stack.go:24 +0x64
sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
sigs.k8s.io/controller-runtime@v0.15.0/pkg/log/log.go:59 +0x104
sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).WithValues(0x40000ce780, {0x4001bb1000, 0x2, 0x2})
sigs.k8s.io/controller-runtime@v0.15.0/pkg/log/deleg.go:168 +0x3c
github.com/go-logr/logr.Logger.WithValues(...)
github.com/go-logr/logr@v1.2.4/logr.go:323
sigs.k8s.io/controller-runtime/pkg/builder.(*Builder).doController(0x4001b45b00, {0x4102ae0, 0x40079bff40})
sigs.k8s.io/controller-runtime@v0.15.0/pkg/builder/controller.go:384 +0x314
sigs.k8s.io/controller-runtime/pkg/builder.(*Builder).Build(0x4001b45b00, {0x4102ae0?, 0x40079bff40?})
sigs.k8s.io/controller-runtime@v0.15.0/pkg/builder/controller.go:239 +0x40
sigs.k8s.io/controller-runtime/pkg/builder.(*Builder).Complete(...)
sigs.k8s.io/controller-runtime@v0.15.0/pkg/builder/controller.go:222
github.com/upbound/provider-aws/internal/controller/iam/accesskey.Setup({0x4148a30, 0x40002a6820}, {{{0x4131680, 0x400000c108}, {0x4130ff0, 0x40005ffe10}, 0x8bb2c97000, 0xa, 0x4001bb0aa0, 0x0}, ...})
github.com/upbound/provider-aws/internal/controller/iam/accesskey/zz_controller.go:58 +0x10b4
github.com/upbound/provider-aws/internal/controller.Setup_iam({0x4148a30, 0x40002a6820}, {{{0x4131680, 0x400000c108}, {0x4130ff0, 0x40005ffe10}, 0x8bb2c97000, 0xa, 0x4001bb0aa0, 0x0}, ...})
github.com/upbound/provider-aws/internal/controller/zz_iam_setup.go:65 +0x1f8
main.main()
github.com/upbound/provider-aws/cmd/provider/iam/zz_main.go:147 +0x2a9c
{"level":"info","ts":"2023-09-14T11:58:27Z","logger":"provider-aws","msg":"Native Terraform provider process error","handle":"d8a857988e269bd8df1d6153cb94773d7b22a652b6cca647536414f804da4141","ttl":100,"ttlMargin":0.1,"nativeProviderPath":"/terraform/provider-mirror/registry.terraform.io/hashicorp/aws/4.67.0/linux_arm64/terraform-provider-aws_v4.67.0_x5","nativeProviderArgs":[],"error":"signal: killed"}
Stream closed EOF for crossplane-system/provider-aws-iam-d4638f0c5651-86b94bf94b-9ths5 (provider-aws-iam)
And this is the output of kubectl get managed
:
NAME READY SYNCED EXTERNAL-NAME AGE
policy.iam.aws.upbound.io/cert-manager-policy True False cert-manager-policy 15d
policy.iam.aws.upbound.io/clb-tst-secrets-manager-policy True False clb-tst-secrets-manager-policy 15d
policy.iam.aws.upbound.io/kaleido-tst-secrets-manager-policy True False kaleido-tst-secrets-manager-policy 15d
policy.iam.aws.upbound.io/lemonade-traefik-oidc-delegate-ecr-policy True False lemonade-traefik-oidc-delegate-ecr-policy 15d
policy.iam.aws.upbound.io/tooling-secrets-manager-policy True False tooling-secrets-manager-policy 15d
policy.iam.aws.upbound.io/value-injector-ecr-policy True False value-injector-ecr-policy 15d
policy.iam.aws.upbound.io/zorglink-tst-secrets-manager-policy True False zorglink-tst-secrets-manager-policy 15d
NAME READY SYNCED EXTERNAL-NAME AGE
rolepolicyattachment.iam.aws.upbound.io/cert-manager-attachment True False cert-manager-role-20230830100923301400000004 15d
rolepolicyattachment.iam.aws.upbound.io/clb-tst-secrets-manager-attachment False False 15d
rolepolicyattachment.iam.aws.upbound.io/crossplane-ec2-attachment True False crossplane-ec2-role-20230830100842783600000002 15d
rolepolicyattachment.iam.aws.upbound.io/crossplane-rds-attachment True False crossplane-rds-role-20230830100837055200000001 15d
rolepolicyattachment.iam.aws.upbound.io/kaleido-tst-secrets-manager-attachment False False 15d
rolepolicyattachment.iam.aws.upbound.io/lemonade-traefik-oidc-delegate-ecr-attachment True False lemonade-traefik-oidc-delegate-ecr-role-20230830100909852300000003 15d
rolepolicyattachment.iam.aws.upbound.io/tooling-secrets-manager-attachment True False 15d
rolepolicyattachment.iam.aws.upbound.io/value-injector-ecr-attachment True False value-injector-ecr-role-20230830101000179800000005 15d
rolepolicyattachment.iam.aws.upbound.io/zorglink-tst-secrets-manager-attachment True False zorglink-tst-secrets-manager-role-20230913195710729200000001 15d
NAME READY SYNCED EXTERNAL-NAME AGE
role.iam.aws.upbound.io/cert-manager-role True False cert-manager-role 15d
role.iam.aws.upbound.io/clb-tst-secrets-manager-role True False clb-tst-secrets-manager-role 15d
role.iam.aws.upbound.io/crossplane-ec2-role True False crossplane-ec2-role 15d
role.iam.aws.upbound.io/crossplane-rds-role True False crossplane-rds-role 15d
role.iam.aws.upbound.io/kaleido-tst-secrets-manager-role True False kaleido-tst-secrets-manager-role 15d
role.iam.aws.upbound.io/lemonade-traefik-oidc-delegate-ecr-role True False lemonade-traefik-oidc-delegate-ecr-role 15d
role.iam.aws.upbound.io/tooling-secrets-manager-role True False tooling-secrets-manager-role 15d
role.iam.aws.upbound.io/value-injector-ecr-role True False value-injector-ecr-role 15d
role.iam.aws.upbound.io/zorglink-tst-secrets-manager-role True False zorglink-tst-secrets-manager-role 15d
Hi,
In my case the issue was fixed by increasing the memory limits.
Provider version: 0.37.0
Here is the controller config for the same
apiVersion: pkg.crossplane.io/v1alpha1
kind: ControllerConfig
metadata:
name: upbound-provider-aws-ec2-controller-config
spec:
replicas: 1
resources:
limits:
memory: 8Gi
requests:
cpu: 2000m
memory: 4Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: false
runAsNonRoot: true
serviceAccountName: upbound-provider-aws-ec2-sa
But now I am facing another issue where aws-ec2-provider
ends up using all the Available CPU on nodes. Not sure if having limits on CPU will have any negative impact on the provider.
We also ended up using more resources on the node when we migrated to provider family instead of monolith provider.
In our case currently we are using following providers
Seeing the same with the IAM aws family provider v0.41.0
This provider repo does not have enough maintainers to address every issue. Since there has been no activity in the last 90 days it is now marked as stale
. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh
will mark this issue as not stale.
What happened?
The provider pod is often terminated with OOMKilled.
How can we reproduce it?
Run it with the memory limit set to 1Gi.
What environment did it happen in?
Remarks
By default the resources object is empty in the pod. What is the amount memory needed by the provider?