dell / csm

Dell Container Storage Modules (CSM)
Apache License 2.0
71 stars 15 forks source link

[BUG]: Dell CSM Operator gets OOMKilled in OpenShift when observability module is enabled #738

Closed grvn closed 1 year ago

grvn commented 1 year ago

Bug Description

This is a issue that I found during my tests when working on https://github.com/dell/csm/issues/728.

Installing the Dell Container Storage Modules Operator and creating a ContainerStorageModule for PowerScale v2.5 and activating Observability v1.4.0 makes the container "manager" in the "dell-csm-operator-controller-manager" pod use more memory than its limit which triggers the OOMKiller.

It seems the Dell CSM Operator creates a "dell-csm-operator-controller-manager" pod which specifies no resources for the "kube-rbac-proxy" container but specifies resource.request and resource.limit for the "manager" container.

the default spec.containers.resources for the "manager" container

    - resources:
        limits:
          cpu: 200m
          memory: 256Mi
        requests:
          cpu: 100m
          memory: 192Mi

During my tests in the other issue I increased the resources.limits.memory to 2Gi and used OpenShifts internal monitoring to check how much memory the "dell-csm-operator-controller-manager" pod used. It seems to peek at around 370-400MB during setup, which is more than the specified `resources.limits.memory which triggers the OOMKiller.

Logs

2023-03-31T11:59:24.278Z DEBUG workspace/main.go:79 Operator Version {"TraceId": "main", "Version": "1.0.0", "Commit ID": "1005033e5e1ef9c8631827372fc3bde061cbbc4d", "Commit SHA": "Mon, 05 Dec 2022 19:46:52 UTC"} 2023-03-31T11:59:24.278Z DEBUG workspace/main.go:80 Go Version: go1.19.3 {"TraceId": "main"} 2023-03-31T11:59:24.278Z DEBUG workspace/main.go:81 Go OS/Arch: linux/amd64 {"TraceId": "main"} I0331 11:59:25.330067 1 request.go:665] Waited for 1.039049587s due to client-side throttling, not priority and fairness, request: GET:https://10.168.0.1:443/apis/autoscaling.openshift.io/v1 2023-03-31T11:59:29.981Z INFO workspace/main.go:93 Openshift environment {"TraceId": "main"} 2023-03-31T11:59:29.983Z INFO workspace/main.go:132 Current kubernetes version is 1.24 which is a supported version {"TraceId": "main"} 2023-03-31T11:59:29.983Z INFO workspace/main.go:143 Use ConfigDirectory /etc/config/dell-csm-operator {"TraceId": "main"} I0331 11:59:35.334311 1 request.go:665] Waited for 5.344706867s due to client-side throttling, not priority and fairness, request: GET:https://10.168.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s 1.680263975689668e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": "127.0.0.1:8080"} 1.680263975690508e+09 INFO setup starting manager 1.6802639756915953e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"} 1.6802639756916835e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"} I0331 11:59:35.691718 1 leaderelection.go:248] attempting to acquire leader lease dell-csm-operators/090cae6a.dell.com... I0331 11:59:51.846488 1 leaderelection.go:258] successfully acquired lease dell-csm-operators/090cae6a.dell.com 1.680263991846571e+09 DEBUG events Normal {"object": {"kind":"ConfigMap","namespace":"dell-csm-operators","name":"090cae6a.dell.com","uid":"c247589b-9931-4258-849b-30ed51006236","apiVersion":"v1","resourceVersion":"1088071981"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-5ddbc7d9dc-4zwjj_2e46b8fd-f5aa-4f11-ac68-0b9c212c8377 became leader"} 1.6802639918466828e+09 DEBUG events Normal {"object": {"kind":"Lease","namespace":"dell-csm-operators","name":"090cae6a.dell.com","uid":"a39a7677-05e0-47ca-8d6b-5179f2fab1a5","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1088071982"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-5ddbc7d9dc-4zwjj_2e46b8fd-f5aa-4f11-ac68-0b9c212c8377 became leader"} 1.680263991846654e+09 INFO controller.containerstoragemodule Starting EventSource {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "source": "kind source: *v1.ContainerStorageModule"} 1.6802639918467033e+09 INFO controller.containerstoragemodule Starting Controller {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule"} 1.680263991947392e+09 INFO controller.containerstoragemodule Starting workers {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "worker count": 1} 2023-03-31T11:59:51.947Z INFO controllers/csm_controller.go:199 ################Starting Reconcile############## {"TraceId": "isilon-1"} 2023-03-31T11:59:51.947Z INFO controllers/csm_controller.go:202 reconcile for {"TraceId": "isilon-1", "Namespace": "dell-storage-powerscale", "Name": "isilon", "Attempt": 1} 2023-03-31T11:59:51.947Z DEBUG drivers/powerscale.go:79 preCheck {"TraceId": "isilon-1", "skipCertValid": false, "certCount": 1, "secrets": 1} 2023-03-31T11:59:56.448Z INFO controllers/csm_controller.go:1110 proceeding with modification of driver install {"TraceId": "isilon-1"} 2023-03-31T11:59:56.453Z INFO controllers/csm_controller.go:1047 Owner reference is found and matches {"TraceId": "isilon-1"} 2023-03-31T11:59:56.453Z INFO modules/observability.go:195
performed pre checks for: observability {"TraceId": "isilon-1"} 2023-03-31T11:59:56.453Z INFO utils/status.go:63 deployment status for cluster: default-source-cluster {"TraceId": "isilon-1"} 2023-03-31T11:59:56.654Z INFO utils/status.go:79 driver type: isilon {"TraceId": "isilon-1"}

Screenshots

No response

Additional Environment Information

No response

Steps to Reproduce

  1. Create namespace karavi
  2. Create namespace dell-csm-operators
  3. Create namespace dell-storage-powerscale
  4. Create Subscription
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
    name: dell-csm-operator-subscription
    namespace: dell-csm-operators
    annotations:
    argocd.argoproj.io/sync-wave: "-1"
    spec:
    channel: stable
    installPlanApproval: Manual
    name: dell-csm-operator-certified
    source: certified-operator-catalog
    sourceNamespace: openshift-marketplace
  5. Manually approve the pending install plan to complete the Operator installation https://docs.openshift.com/container-platform/4.11/operators/admin/olm-upgrading-operators.html#olm-approving-pending-upgrade_olm-upgrading-operators
  6. Create ContainerStorageModule for PowerScale with active Observability Used https://github.com/dell/csm-operator/blob/3c456c45581d6b70d1ea4f54976974a361123bb1/samples/storage_csm_powerscale_v250.yaml as template
    apiVersion: storage.dell.com/v1
    kind: ContainerStorageModule
    metadata:
    name: isilon
    namespace: dell-storage-powerscale
    spec:
    driver:
    common:
      image: "dellemc/csi-isilon:v2.5.0"
      imagePullPolicy: IfNotPresent
      envs:
        - name: X_CSI_VERBOSE
          value: "1"
        - name: X_CSI_ISI_PORT
          value: "8080"
        - name: X_CSI_ISI_NO_PROBE_ON_START
          value: "false"
        - name: X_CSI_ISI_AUTOPROBE
          value: "true"
        - name: X_CSI_ISI_AUTH_TYPE
          value: "0"
        - name: X_CSI_CUSTOM_TOPOLOGY_ENABLED
          value: "false"
        - name: X_CSI_MAX_PATH_LIMIT
          value: "192"
        - name: X_CSI_DEBUG
          value: "false"
        - name: "CERT_SECRET_COUNT"
          value: "1"
        - name: KUBELET_CONFIG_DIR
          value: /var/lib/kubelet
    configVersion: v2.5.0
    controller:
      envs:
      - name: X_CSI_ISI_QUOTA_ENABLED
        value: "false"
      - name: X_CSI_ISI_ACCESS_ZONE
        value: "zone"
      - name: X_CSI_HEALTH_MONITOR_ENABLED
        value: "false"
      nodeSelector:
        node-role.kubernetes.io/worker: ''
    dnsPolicy: ClusterFirst
    csiDriverSpec:
      fSGroupPolicy: ReadWriteOnceWithFSType
    csiDriverType: "isilon"
    forceRemoveDriver: true
    node:
      nodeSelector:
        node-role.kubernetes.io/worker: ''
      envs:
      - name: X_CSI_ISILON_NFS_V3
        value: "false"
      - name: X_CSI_MAX_VOLUMES_PER_NODE
        value: "0"
      - name: X_CSI_HEALTH_MONITOR_ENABLED
        value: "false"
    replicas: 2
    sideCars:
      - name: common
        args:
          - '--leader-election-lease-duration=15s'
          - '--leader-election-renew-deadline=10s'
          - '--leader-election-retry-period=5s'
      - name: provisioner
        args:
          - '--volume-name-prefix=csipscale'
      - name: external-health-monitor
        enabled: false
        args: ["--monitor-interval=60s"]
    modules:
    - name: observability
      enabled: true
      configVersion: v1.4.0
      components: 
        - name: topology
          enabled: true
          image: dellemc/csm-topology:v1.4.0
          envs:
            - name: "TOPOLOGY_LOG_LEVEL"
              value: "INFO"              
        - name: otel-collector
          enabled: true
          image: otel/opentelemetry-collector:0.42.0
        - name: metrics-powerscale
          enabled: true
          image: dellemc/csm-metrics-powerscale:v1.1.0
          envs:
            - name: "POWERSCALE_MAX_CONCURRENT_QUERIES"
              value: "10"
            - name: "POWERSCALE_CAPACITY_METRICS_ENABLED"
              value: "true"
            - name: "POWERSCALE_PERFORMANCE_METRICS_ENABLED"
              value: "true"
            - name: "POWERSCALE_CLUSTER_CAPACITY_POLL_FREQUENCY"
              value: "30"
            - name: "POWERSCALE_CLUSTER_PERFORMANCE_POLL_FREQUENCY"
              value: "20"
            - name: "POWERSCALE_QUOTA_CAPACITY_POLL_FREQUENCY"
              value: "30"
            - name: "ISICLIENT_INSECURE"
              value: "true"
            - name: "ISICLIENT_AUTH_TYPE"
              value: "1"
            - name: "ISICLIENT_VERBOSE"
              value: "0"
            - name: "POWERSCALE_LOG_LEVEL"
              value: "INFO"
            - name: "POWERSCALE_LOG_FORMAT"
              value: "TEXT"
            - name: "COLLECTOR_ADDRESS"
              value: "otel-collector:55680"

Expected Behavior

CSI Driver for PowerScale initialized and running with Observability activated without the dell-csm-operator-controller-manager pod, manager container gets OOMKilled.

CSM Driver(s)

CSI Driver for PowerScale v2.5.0

Installation Type

https://catalog.redhat.com/software/containers/dellemc/dell-csm-operator/63a2b0f2d571d0d25998cc55

Container Storage Modules Enabled

Observability v1.4.0

Container Orchestrator

OpenShift 4.11

Operating System

CoreOS for OpenShift 4.11

csmbot commented 1 year ago

@grvn: Thank you for submitting this issue!

The issue is currently awaiting triage. Please make sure you have given us as much context as possible.

If the maintainers determine this is a relevant issue, they will remove the needs-triage label and assign an appropriate priority label.


We want your feedback! If you have any questions or suggestions regarding our contributing process/workflow, please reach out to us at container.storage.modules@dell.com.

rensyct commented 1 year ago

@grvn , Please help schedule a zoom call on 3rd April 2023 so that I could take a look at your environment. I work in IST timezone

grvn commented 1 year ago

@rensyct Apologies for the late reply; I've been ill a couple of days and haven't been able to work. I work in CEST timezone so 3.5 hours after IST. Because of easter I only work the afternoon tomorrow ( from 16:30 IST ), I do not know if that's to late for you? Do you also have national holiday april 7th and 10th?

rensyct commented 1 year ago

Hi @grvn Apologies for the delay in response. I was on leave from 6th April. Let us plan to have a zoom call either today or tomorrow.

grvn commented 1 year ago

Tomorrow ( 11 April ) works great. I'm available from 10:00 CEST and the following 6 hours.

rensyct commented 1 year ago

Thank you @grvn for your response. Please let me know if 3:00 PM IST works fine for you today.

grvn commented 1 year ago

Hi @rensyct 3:00 pm IST today works fine for me.

rensyct commented 1 year ago

Hi @grvn Please send the meeting link to rensy.thomas@dell.com

grvn commented 1 year ago

Hi @rensyct, I've sent a invite to Skype for Business, i hope that works for you. I'm not allowed to use Zoom as long as the company owning it is registered outside of EU. One exception, if a support agent from a company which we buy products from invites me to a support call.

rensyct commented 1 year ago

Thank you @grvn for scheduling the call today and for sending across the requested logs and manifests. I went through the manifests and identified some changes that should be done. Please let me know if we could have a call at 3:00 PM IST tomorrow to check the same

grvn commented 1 year ago

Hi @rensyct, I am unfortunately unavailable tomorrow at that time. Maybe 13:00 IST on Thursday (13 April )

rensyct commented 1 year ago

Hi @grvn , Please let me know if any other time works for you today. Tomorrow does 3:00 PM IST work for you? 13:00 IST will not work for me.

rensyct commented 1 year ago

Hi @grvn Please help with the below logs after reverting the memory related changes in operator

  1. Previous logs of the operator. Logs before the crash of operator. kubectl logs <operatorPod> -n <operatorNamespace> -c manager --previous
  2. Current logs of the operator kubectl logs <operatorPod> -n <operatorNamespace> -c manager
  3. Previous logs of the driver controller pods.
  4. Current logs of the driver controller pods if they exist.
  5. Do you see any restarts of the driver controller pods?
rensyct commented 1 year ago

Hi @grvn, Please help provide an update.

prablr79 commented 1 year ago

@rensyct please close this defect if customer is not responded. please work with Support process to enable any further customer query.

grvn commented 1 year ago

@rensyct - sorry for the delay, I must have done something wrong because GitHub didn't ping me about your questions. I've attached all the logs for 1,2,3 in an email to you.

Do you see any restarts of the driver controller pods?

No, the driver controller remains unaffected by the OOMKilled operator, it has been running for 20 days even after I reset the operator config so it started OOMKilled.

@prablr79 - thank you for writing, GitHub pinged me that someone had written when you wrote.

rensyct commented 1 year ago

Thank you @grvn for providing the requested info. Will go through them and get back to you

rensyct commented 1 year ago

Hi @grvn As discussed over the call on June 2nd, csm-operator v1.2.0 images will be released by June last week and the same will be available in OperatorHub by July second week. In this release, there are few fixes that went in for csi-powerscale in csm-operator repo and this should fix the issue that is seen in the environment. Please confirm if we could proceed to close this ticket since there is a workaround available currently.

grvn commented 1 year ago

Hi @rensyct Yes, you can close this issue for now. We will continue to use our workaround and open a new case if this issue still exists after upgrading to the new CSM operator version.

If anybody has the same issue, our current workaround is to explicitly give the operator more memory by modification of the Subscription

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: dell-csm-operator-subscription
spec:
  config:
    resources:
      limits:
        memory: 1Gi
      requests:
        cpu: 10m
        memory: 400Mi
...
...
rensyct commented 1 year ago

Thank you @grvn