Closed grvn closed 1 year ago
@grvn: Thank you for submitting this issue!
The issue is currently awaiting triage. Please make sure you have given us as much context as possible.
If the maintainers determine this is a relevant issue, they will remove the needs-triage label and assign an appropriate priority label.
We want your feedback! If you have any questions or suggestions regarding our contributing process/workflow, please reach out to us at container.storage.modules@dell.com.
@grvn , Please help schedule a zoom call on 3rd April 2023 so that I could take a look at your environment. I work in IST timezone
@rensyct Apologies for the late reply; I've been ill a couple of days and haven't been able to work. I work in CEST timezone so 3.5 hours after IST. Because of easter I only work the afternoon tomorrow ( from 16:30 IST ), I do not know if that's to late for you? Do you also have national holiday april 7th and 10th?
Hi @grvn Apologies for the delay in response. I was on leave from 6th April. Let us plan to have a zoom call either today or tomorrow.
Tomorrow ( 11 April ) works great. I'm available from 10:00 CEST and the following 6 hours.
Thank you @grvn for your response. Please let me know if 3:00 PM IST works fine for you today.
Hi @rensyct 3:00 pm IST today works fine for me.
Hi @grvn Please send the meeting link to rensy.thomas@dell.com
Hi @rensyct, I've sent a invite to Skype for Business, i hope that works for you. I'm not allowed to use Zoom as long as the company owning it is registered outside of EU. One exception, if a support agent from a company which we buy products from invites me to a support call.
Thank you @grvn for scheduling the call today and for sending across the requested logs and manifests. I went through the manifests and identified some changes that should be done. Please let me know if we could have a call at 3:00 PM IST tomorrow to check the same
Hi @rensyct, I am unfortunately unavailable tomorrow at that time. Maybe 13:00 IST on Thursday (13 April )
Hi @grvn , Please let me know if any other time works for you today. Tomorrow does 3:00 PM IST work for you? 13:00 IST will not work for me.
Hi @grvn Please help with the below logs after reverting the memory related changes in operator
kubectl logs <operatorPod> -n <operatorNamespace> -c manager --previous
kubectl logs <operatorPod> -n <operatorNamespace> -c manager
Hi @grvn, Please help provide an update.
@rensyct please close this defect if customer is not responded. please work with Support process to enable any further customer query.
@rensyct - sorry for the delay, I must have done something wrong because GitHub didn't ping me about your questions. I've attached all the logs for 1,2,3 in an email to you.
Do you see any restarts of the driver controller pods?
No, the driver controller remains unaffected by the OOMKilled operator, it has been running for 20 days even after I reset the operator config so it started OOMKilled.
@prablr79 - thank you for writing, GitHub pinged me that someone had written when you wrote.
Thank you @grvn for providing the requested info. Will go through them and get back to you
Hi @grvn As discussed over the call on June 2nd, csm-operator v1.2.0 images will be released by June last week and the same will be available in OperatorHub by July second week. In this release, there are few fixes that went in for csi-powerscale in csm-operator repo and this should fix the issue that is seen in the environment. Please confirm if we could proceed to close this ticket since there is a workaround available currently.
Hi @rensyct Yes, you can close this issue for now. We will continue to use our workaround and open a new case if this issue still exists after upgrading to the new CSM operator version.
If anybody has the same issue, our current workaround is to explicitly give the operator more memory by modification of the Subscription
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: dell-csm-operator-subscription
spec:
config:
resources:
limits:
memory: 1Gi
requests:
cpu: 10m
memory: 400Mi
...
...
Thank you @grvn
Bug Description
This is a issue that I found during my tests when working on https://github.com/dell/csm/issues/728.
Installing the Dell Container Storage Modules Operator and creating a ContainerStorageModule for PowerScale v2.5 and activating Observability v1.4.0 makes the container "manager" in the "dell-csm-operator-controller-manager" pod use more memory than its limit which triggers the OOMKiller.
It seems the Dell CSM Operator creates a "dell-csm-operator-controller-manager" pod which specifies no resources for the "kube-rbac-proxy" container but specifies resource.request and resource.limit for the "manager" container.
the default
spec.containers.resources
for the "manager" containerDuring my tests in the other issue I increased the
resources.limits.memory
to2Gi
and used OpenShifts internal monitoring to check how much memory the "dell-csm-operator-controller-manager" pod used. It seems to peek at around 370-400MB during setup, which is more than the specified `resources.limits.memory which triggers the OOMKiller.Logs
2023-03-31T11:59:24.278Z DEBUG workspace/main.go:79 Operator Version {"TraceId": "main", "Version": "1.0.0", "Commit ID": "1005033e5e1ef9c8631827372fc3bde061cbbc4d", "Commit SHA": "Mon, 05 Dec 2022 19:46:52 UTC"} 2023-03-31T11:59:24.278Z DEBUG workspace/main.go:80 Go Version: go1.19.3 {"TraceId": "main"} 2023-03-31T11:59:24.278Z DEBUG workspace/main.go:81 Go OS/Arch: linux/amd64 {"TraceId": "main"} I0331 11:59:25.330067 1 request.go:665] Waited for 1.039049587s due to client-side throttling, not priority and fairness, request: GET:https://10.168.0.1:443/apis/autoscaling.openshift.io/v1 2023-03-31T11:59:29.981Z INFO workspace/main.go:93 Openshift environment {"TraceId": "main"} 2023-03-31T11:59:29.983Z INFO workspace/main.go:132 Current kubernetes version is 1.24 which is a supported version {"TraceId": "main"} 2023-03-31T11:59:29.983Z INFO workspace/main.go:143 Use ConfigDirectory /etc/config/dell-csm-operator {"TraceId": "main"} I0331 11:59:35.334311 1 request.go:665] Waited for 5.344706867s due to client-side throttling, not priority and fairness, request: GET:https://10.168.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s 1.680263975689668e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": "127.0.0.1:8080"} 1.680263975690508e+09 INFO setup starting manager 1.6802639756915953e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"} 1.6802639756916835e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"} I0331 11:59:35.691718 1 leaderelection.go:248] attempting to acquire leader lease dell-csm-operators/090cae6a.dell.com... I0331 11:59:51.846488 1 leaderelection.go:258] successfully acquired lease dell-csm-operators/090cae6a.dell.com 1.680263991846571e+09 DEBUG events Normal {"object": {"kind":"ConfigMap","namespace":"dell-csm-operators","name":"090cae6a.dell.com","uid":"c247589b-9931-4258-849b-30ed51006236","apiVersion":"v1","resourceVersion":"1088071981"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-5ddbc7d9dc-4zwjj_2e46b8fd-f5aa-4f11-ac68-0b9c212c8377 became leader"} 1.6802639918466828e+09 DEBUG events Normal {"object": {"kind":"Lease","namespace":"dell-csm-operators","name":"090cae6a.dell.com","uid":"a39a7677-05e0-47ca-8d6b-5179f2fab1a5","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1088071982"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-5ddbc7d9dc-4zwjj_2e46b8fd-f5aa-4f11-ac68-0b9c212c8377 became leader"} 1.680263991846654e+09 INFO controller.containerstoragemodule Starting EventSource {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "source": "kind source: *v1.ContainerStorageModule"} 1.6802639918467033e+09 INFO controller.containerstoragemodule Starting Controller {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule"} 1.680263991947392e+09 INFO controller.containerstoragemodule Starting workers {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "worker count": 1} 2023-03-31T11:59:51.947Z INFO controllers/csm_controller.go:199 ################Starting Reconcile############## {"TraceId": "isilon-1"} 2023-03-31T11:59:51.947Z INFO controllers/csm_controller.go:202 reconcile for {"TraceId": "isilon-1", "Namespace": "dell-storage-powerscale", "Name": "isilon", "Attempt": 1} 2023-03-31T11:59:51.947Z DEBUG drivers/powerscale.go:79 preCheck {"TraceId": "isilon-1", "skipCertValid": false, "certCount": 1, "secrets": 1} 2023-03-31T11:59:56.448Z INFO controllers/csm_controller.go:1110 proceeding with modification of driver install {"TraceId": "isilon-1"} 2023-03-31T11:59:56.453Z INFO controllers/csm_controller.go:1047 Owner reference is found and matches {"TraceId": "isilon-1"} 2023-03-31T11:59:56.453Z INFO modules/observability.go:195
performed pre checks for: observability {"TraceId": "isilon-1"} 2023-03-31T11:59:56.453Z INFO utils/status.go:63 deployment status for cluster: default-source-cluster {"TraceId": "isilon-1"} 2023-03-31T11:59:56.654Z INFO utils/status.go:79 driver type: isilon {"TraceId": "isilon-1"}
Screenshots
No response
Additional Environment Information
No response
Steps to Reproduce
karavi
dell-csm-operators
dell-storage-powerscale
Expected Behavior
CSI Driver for PowerScale initialized and running with Observability activated without the
dell-csm-operator-controller-manager
pod,manager
container gets OOMKilled.CSM Driver(s)
CSI Driver for PowerScale v2.5.0
Installation Type
https://catalog.redhat.com/software/containers/dellemc/dell-csm-operator/63a2b0f2d571d0d25998cc55
Container Storage Modules Enabled
Observability v1.4.0
Container Orchestrator
OpenShift 4.11
Operating System
CoreOS for OpenShift 4.11