[BUG]: Update resources limits for controller-manager to fix OOMKilled error

cassanellicarlo commented 1 year ago

How can the Team help you today?

Details: ?

I'm using dell-csm-operator-certified.v1.2.0 operator on OpenShift 4.12. I installed it successfully, but the controller-manager is getting OOM-killed because it's consuming more memory than the limit set.

The default limit for the container is set to 256Mi. How can one increase it in the ContainerStorageModule resource?

jooseppi-luna commented 1 year ago

Hi Carlo, thanks for the question. Are you getting OOM-killed before you do anything with the operator or are you getting killed while trying to do a bunch of stuff with it? Do you have any relevant logs?

jooseppi-luna commented 1 year ago

Looks like I am able to add additional memory by editing line 921 of the deploy/operator.yaml file. After editing that line to 512Mi and reinstalling, I get the following when I describe the controller-manager pod (snipped for readability):

[root@master-1-095zyzFtPRfV5 csm-operator]# k describe pod -n dell-csm-operator   dell-csm-operator-controller-manager-6bd6569b56-bqbs5
...
Containers:
  manager:
    Container ID:  containerd://17f0b8031735e468fdb066ae31e119b174a5ab567a4d7d69aa386714b4701f62
    Image:         docker.io/dellemc/dell-csm-operator:v1.2.0
    Image ID:      docker.io/dellemc/dell-csm-operator@sha256:814895bdff2f49c0f9a7789490e6316688f85e3cab2c0a6215fa0f68034c5f32
    Port:          <none>
    Host Port:     <none>
    Command:
      /manager
    Args:
      --leader-elect
    State:          Running
      Started:      Wed, 13 Sep 2023 09:40:25 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     200m
      memory:  512Mi
    Requests:
      cpu:        100m
      memory:     192Mi
...

jooseppi-luna commented 1 year ago

If you could provide us with details of anything else that you might have installed on the system, as well as what all the operator has done leading up to the OOM kill, that would be super helpful! Thanks.

cassanellicarlo commented 1 year ago

Starting logs of controller manager:

2023-09-13T12:35:32.398Z DEBUG workspace/main.go:79 Operator Version {"TraceId": "main", "Version": "1.2.0", "Commit ID": "081702a4c6969af8038a31eaf072b13554323f51", "Commit SHA": "Fri, 23 Jun 2023 07:46:51 UTC"} 2023-09-13T12:35:32.398Z DEBUG workspace/main.go:80 Go Version: go1.20.5 {"TraceId": "main"} 2023-09-13T12:35:32.398Z DEBUG workspace/main.go:81 Go OS/Arch: linux/amd64 {"TraceId": "main"} I0913 12:35:33.500640 1 request.go:665] Waited for 1.01097461s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/logging.openshift.io/v1 2023-09-13T12:35:39.751Z INFO workspace/main.go:93 Openshift environment {"TraceId": "main"} 2023-09-13T12:35:39.753Z INFO workspace/main.go:132 Current kubernetes version is 1.25 which is a supported version {"TraceId": "main"} 2023-09-13T12:35:39.754Z INFO workspace/main.go:143 Use ConfigDirectory /etc/config/dell-csm-operator {"TraceId": "main"} I0913 12:35:43.505285 1 request.go:665] Waited for 3.743544237s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/apps.gitlab.com/v1beta2?timeout=32s 1.6946085471103349e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": "127.0.0.1:8080"} 1.6946085471115081e+09 INFO setup starting manager 1.694608547111709e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"} 1.694608547111713e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}

Previous last logs from restarted container

[BRUH] toleration t: {Key:node-role.kubernetes.io/infra Operator:Exists Value: Effect:NoSchedule TolerationSeconds:<nil>} 2023-09-13T12:35:05.760Z DEBUG drivers/commonconfig.go:40 GetController {"TraceId": "<omitted>-unity-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/unity/v2.7.0/controller.yaml"} 2023-09-13T12:35:05.760Z ERROR zap@v1.21.0/sugar.go:173 Ignored key without a value. {"TraceId": "<omitted>-unity-1", "ignored": {"driver":{"csiDriverType":"unity","csiDriverSpec":{"fSGroupPolicy":"ReadWriteOnceWithFSType"},"configVersion":"v2.7.0","replicas":2,"dnsPolicy":"ClusterFirstWithHostNet","common":{"image":"dellemc/csi-unity:v2.7.0","imagePullPolicy":"IfNotPresent","envs":[{"name":"X_CSI_UNITY_ALLOW_MULTI_POD_ACCESS","value":"false"},{"name":"X_CSI_EPHEMERAL_STAGING_PATH","value":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/"},{"name":"X_CSI_ISCSI_CHROOT","value":"/noderoot"},{"name":"X_CSI_UNITY_SYNC_NODEINFO_INTERVAL","value":"15"},{"name":"KUBELET_CONFIG_DIR","value":"/var/lib/kubelet"},{"name":"CSI_LOG_LEVEL","value":"info"},{"name":"CERT_SECRET_COUNT","value":"1"},{"name":"X_CSI_UNITY_SKIP_CERTIFICATE_VALIDATION","value":"true"}]},"controller":{"envs":[{"name":"X_CSI_HEALTH_MONITOR_ENABLED","value":"true"}],"tolerations":[{"key":"node-role.kubernetes.io/infra","operator":"Exists","effect":"NoSc... 2023-09-13T12:35:05.760Z DEBUG drivers/commonconfig.go:51 DriverSpec {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:05.764Z DEBUG drivers/commonconfig.go:72 Adding toleration {"TraceId": "<omitted>-unity-1", "t": {"key":"node-role.kubernetes.io/infra","operator":"Exists","effect":"NoSchedule"}} 2023-09-13T12:35:05.764Z INFO drivers/commonconfig.go:111 Container to be removed {"TraceId": "<omitted>-unity-1", "name": "external-health-monitor"} 2023-09-13T12:35:05.764Z INFO controllers/csm_controller.go:530 Checking if standalone modules need clean up {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:05.775Z INFO controllers/csm_controller.go:723 Starting SYNC for default-source-cluster cluster {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:05.976Z INFO serviceaccount/serviceaccount.go:45 ServiceAccount already exists {"TraceId": "<omitted>-unity-1", "Name:": "<omitted>-unity-node"} 2023-09-13T12:35:05.976Z INFO serviceaccount/serviceaccount.go:45 ServiceAccount already exists {"TraceId": "<omitted>-unity-1", "Name:": "<omitted>-unity-controller"} 2023-09-13T12:35:06.077Z INFO rbac/clusterrole.go:45 Updating ClusterRoleName:<omitted>-unity-node {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:06.111Z INFO rbac/clusterrole.go:45 Updating ClusterRoleName:<omitted>-unity-controller {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:06.242Z INFO rbac/rolebindings.go:40 Updating ClusterRoleBindingName:<omitted>-unity-node {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:06.275Z INFO rbac/rolebindings.go:40 Updating ClusterRoleBindingName:<omitted>-unity-controller {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:06.407Z INFO csidriver/csidriver.go:41 CSIDriver Object exist {"TraceId": "<omitted>-unity-1", "Name:": "csi-unity.dellemc.com"}

The only error i'm seeing is "ERROR zap@v1.21.0/sugar.go:173 Ignored key without a value. " but i don't know if that is related.

I'm installing the Operator via OLM Subscription. I'm not using the operator.yaml

Metrics of the controller manager

cassanellicarlo commented 1 year ago

I manually changed the limits in the operator yaml from the OpenShift console

              - resources:
                  limits:
                    cpu: 200m
                    memory: 500Mi
                  requests:
                    cpu: 100m
                    memory: 200Mi

and now the controller manager seems to work fine without restarting. But that's not a good way to set it.

jooseppi-luna commented 1 year ago

Ok that's good, I'm glad it's at least not getting killed right now. I agree that that's not a good long-term solution, we will work on a better fix and keep this issue updated.

bharathsreekanth commented 1 year ago

@jooseppi-luna can you confirm if this is same as https://github.com/dell/csm/issues/184?

jooseppi-luna commented 1 year ago

@bharathsreekanth it's related but not the same, https://github.com/dell/csm/issues/184 is for adding resource limits to helm charts. These resource limits already exist in operator and are what we are adjusting here to make the deployment work. See here for where we set them in operator.

jooseppi-luna commented 1 year ago

@cassanellicarlo I spoke with @rensyct and it would help us to have these three things from you to figure this out:

1) Details on everything you installed/attempted to install with operator before it got killed. 2) Attach the sample files you used to install any drivers/modules you are installing (e.g., I can see you are installing csi-unity, can you attach your edited sample file for us to review and test on our end). 3) Attach complete operator logs from your controller-manager, you can get them like this: kubectl logs dell-csm-operator-controller-manager-xxxxxxxxxx-xxxxx -n dell-csm-operator > operator-logs.txt (fill in your pod name and namespace).

cassanellicarlo commented 1 year ago

Operator: dell-csm-operator-certified.v1.2.0

ContainerStorageModule

`apiVersion: storage.dell.com/v1
kind: ContainerStorageModule
metadata:
  name: 
  namespace: {{ .Values.namespace }}
spec:
  driver:
    csiDriverType: "unity"
    csiDriverSpec:
      # fsGroupPolicy: Defines if the underlying volume supports changing ownership and permission of the volume before being mounted.
      # Allowed values: ReadWriteOnceWithFSType, File , None
      # Default value: ReadWriteOnceWithFSType
      fSGroupPolicy: "ReadWriteOnceWithFSType"
    # Config version for CSI Unity v2.7.0 driver
    configVersion: {{ .Values.driver.release }}
    # Controller count
    replicas: 2
    dnsPolicy: ClusterFirstWithHostNet
    forceUpdate: false
    forceRemoveDriver: true
    common:
      # Image for CSI Unity driver v2.7.0
      image: "dellemc/csi-unity:{{ .Values.driver.release }}"
      imagePullPolicy: IfNotPresent
      envs:
          # X_CSI_UNITY_ALLOW_MULTI_POD_ACCESS - Flag to enable sharing of volumes across multiple pods within the same node in RWO access mode.
          # Allowed values: boolean
          # Default value: "false"
          # Examples : "true" , "false"
        - name: X_CSI_UNITY_ALLOW_MULTI_POD_ACCESS
          value: "false"
        - name: X_CSI_EPHEMERAL_STAGING_PATH
          value: "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/"
        # X_CSI_ISCSI_CHROOT is the path to which the driver will chroot before
        # running any iscsi commands. This value should only be set when instructed
        # by technical support
        - name: X_CSI_ISCSI_CHROOT
          value: "/noderoot"
        # X_CSI_UNITY_SYNC_NODEINFO_INTERVAL - Time interval to add node info to array. Default 15 minutes. Minimum value should be 1.
        # Allowed values: integer
        # Default value: 15
        # Examples : 0 , 2
        - name: X_CSI_UNITY_SYNC_NODEINFO_INTERVAL
          value: "15"
        # Specify kubelet config dir path.
        # Ensure that the config.yaml file is present at this path.
        # Default value: None
        - name: KUBELET_CONFIG_DIR
          value: /var/lib/kubelet
        # CSI_LOG_LEVEL is used to set the logging level of the driver.
        # Allowed values: "error", "warn"/"warning", "info", "debug"
        # Default value: "info"
        - name: CSI_LOG_LEVEL
          value: {{ .Values.logLevel }}
        # TENANT_NAME - Tenant name that need to added while adding host entry to the array.
        # Allowed values: string
        # Default value: ""
        # Examples : "tenant2" , "tenant3"
        - name: TENANT_NAME
          value: ""
        # CERT_SECRET_COUNT: Represents number of certificate secrets, which user is going to create for
        # ssl authentication. (unity-cert-0..unity-cert-n)
        # This field is only verified if X_CSI_UNITY_SKIP_CERTIFICATE_VALIDATION is set to false
        # Allowed values: n, where n > 0
        # Default value: None          
        - name: CERT_SECRET_COUNT
          value: "1"
        # X_CSI_UNITY_SKIP_CERTIFICATE_VALIDATION: Specifies if the driver is going to validate unisphere certs while connecting to the Unisphere REST API interface.
        # If it is set to false, then a secret unity-certs has to be created with an X.509 certificate of CA which signed the Unisphere certificate
        # Allowed values:
        #   true: skip Unisphere API server's certificate verification
        #   false: verify Unisphere API server's certificates 
        # Default value: true   
        - name: X_CSI_UNITY_SKIP_CERTIFICATE_VALIDATION
          value: "true"

    sideCars:
      # health monitor is disabled by default, refer to driver documentation before enabling it
      - name: external-health-monitor
        enabled: false
        args: ["--monitor-interval=60s"]
    controller:
      envs:
        # X_CSI_HEALTH_MONITOR_ENABLED: Enable/Disable health monitor of CSI volumes from Controller plugin - volume condition.
        # Install the 'external-health-monitor' sidecar accordingly.
        # Allowed values:
        #   true: enable checking of health condition of CSI volumes
        #   false: disable checking of health condition of CSI volumes
        # Default value: false
        - name: X_CSI_HEALTH_MONITOR_ENABLED
          value: "true"
      #nodeSelector:
      # Uncomment if nodes you wish to use have the node-role.kubernetes.io/control-plane taint
      #  node-role.kubernetes.io/control-plane: ""

      # tolerations: Define tolerations for the controllers, if required.
      # Leave as blank to install controller on worker nodes
      # Default value: None
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/infra
          operator: Exists
    node:
      envs:
        # X_CSI_HEALTH_MONITOR_ENABLED: Enable/Disable health monitor of CSI volumes from node plugin - volume usage
        # Allowed values:
        #   true: enable checking of health condition of CSI volumes
        #   false: disable checking of health condition of CSI volumes
        # Default value: false
        - name: X_CSI_HEALTH_MONITOR_ENABLED
          value: "true"

      # nodeSelector: Define node selection constraints for node pods.
      # For the pod to be eligible to run on a node, the node must have each
      # of the indicated key-value pairs as labels.
      # Leave as blank to consider all nodes
      # Allowed values: map of key-value pairs
      # Default value: None
      #nodeSelector:
      # Uncomment if nodes you wish to use have the node-role.kubernetes.io/control-plane taint
      #  node-role.kubernetes.io/control-plane: ""

      # tolerations: Define tolerations for the controllers, if required.
      # Leave as blank to install controller on worker nodes
      # Default value: None
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/infra
          operator: Exists`

Logs

2023-09-13T12:24:00.532Z    DEBUG   workspace/main.go:79    Operator Version    {"TraceId": "main", "Version": "1.2.0", "Commit ID": "081702a4c6969af8038a31eaf072b13554323f51", "Commit SHA": "Fri, 23 Jun 2023 07:46:51 UTC"}
2023-09-13T12:24:00.532Z    DEBUG   workspace/main.go:80    Go Version: go1.20.5    {"TraceId": "main"}
2023-09-13T12:24:00.532Z    DEBUG   workspace/main.go:81    Go OS/Arch: linux/amd64 {"TraceId": "main"}
I0913 12:24:01.656484       1 request.go:665] Waited for 1.042073415s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/ingress.operator.openshift.io/v1
2023-09-13T12:24:07.908Z    INFO    workspace/main.go:93    Openshift environment   {"TraceId": "main"}
2023-09-13T12:24:07.911Z    INFO    workspace/main.go:132   Current kubernetes version is 1.25 which is a supported version     {"TraceId": "main"}
2023-09-13T12:24:07.911Z    INFO    workspace/main.go:143   Use ConfigDirectory /etc/config/dell-csm-operator   {"TraceId": "main"}
I0913 12:24:11.663191       1 request.go:665] Waited for 3.742098704s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/performance.openshift.io/v1alpha1?timeout=32s
1.6946078552691364e+09  INFO    controller-runtime.metrics  Metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
1.6946078552714586e+09  INFO    setup   starting manager
1.6946078552724323e+09  INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
I0913 12:24:15.272451       1 leaderelection.go:248] attempting to acquire leader lease dell-csm/090cae6a.dell.com...
1.6946078552724354e+09  INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
I0913 12:24:34.016487       1 leaderelection.go:258] successfully acquired lease dell-csm/090cae6a.dell.com
1.6946078740166562e+09  INFO    controller.containerstoragemodule   Starting EventSource    {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "source": "kind source: *v1.ContainerStorageModule"}
1.694607874016701e+09   INFO    controller.containerstoragemodule   Starting Controller {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule"}
1.6946078740167027e+09  DEBUG   events  Normal  {"object": {"kind":"ConfigMap","namespace":"dell-csm","name":"090cae6a.dell.com","uid":"80857b55-a5bd-405f-91f6-9e50580ecc85","apiVersion":"v1","resourceVersion":"1282335978"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-7b8dc694fd-9vh5n_de44f76c-dbdc-4a1a-8624-e1420fff6861 became leader"}
1.6946078740168204e+09  DEBUG   events  Normal  {"object": {"kind":"Lease","namespace":"dell-csm","name":"090cae6a.dell.com","uid":"3ae38669-e585-44cf-9ae1-7a7cc849e250","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1282335979"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-7b8dc694fd-9vh5n_de44f76c-dbdc-4a1a-8624-e1420fff6861 became leader"}
1.6946078741176162e+09  INFO    controller.containerstoragemodule   Starting workers    {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "worker count": 1}
2023-09-13T12:24:34.117Z    INFO    controllers/csm_controller.go:203   ################Starting Reconcile##############    {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:34.117Z    INFO    controllers/csm_controller.go:206   reconcile for   {"TraceId": "<omitted>-unity-1", "Namespace": "dell-csm", "Name": "<omitted>-unity", "Attempt": 1}
2023-09-13T12:24:34.117Z    DEBUG   drivers/unity.go:88 preCheck    {"TraceId": "<omitted>-unity-1", "secrets": 1, "certCount": 1, "Namespace": "dell-csm"}
2023-09-13T12:24:35.918Z    INFO    controllers/csm_controller.go:1202  proceeding with modification of driver install  {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:35.923Z    INFO    controllers/csm_controller.go:1130  Owner reference is found and matches    {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:35.923Z    INFO    utils/status.go:156 
daemonset status for cluster: default-source-cluster    {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.226Z    INFO    utils/status.go:181 daemonset pod <omitted>-unity-node-dzj26 : Running  {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.226Z    INFO    utils/status.go:181 daemonset pod <omitted>-unity-node-7lr7z : Running  {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:181 daemonset pod <omitted>-unity-node-rz2xj : Running  {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:181 daemonset pod <omitted>-unity-node-p2gr5 : Running  {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:181 daemonset pod <omitted>-unity-node-wxk4t : Running  {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:181 daemonset pod <omitted>-unity-node-gzs7r : Running  {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:181 daemonset pod <omitted>-unity-node-srh72 : Running  {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:204 daemonset status available pods 7   {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:205 daemonset status failedCount pods 0 {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:206 daemonset status desired pods 7 {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:239 deployment controllerReplicas [2]   {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:240 deployment controllerStatus.Available [2]   {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:242 daemonset expected [7]  {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:243 daemonset nodeStatus.Available [7]  {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:249 calculate overall state [Succeeded] {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:277 Driver State    {"TraceId": "<omitted>-unity-1", "Controller": {"available":"2","desired":"2","failed":"0"}, "Node": {"available":"7","desired":"7","failed":"0"}}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:365 HandleSuccess Driver state  {"TraceId": "<omitted>-unity-1", "newStatus.State": "Running"}
2023-09-13T12:24:36.227Z    INFO    utils/status.go:369 HandleSuccess Driver state didn't change from Running   {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    INFO    controllers/csm_controller.go:887   Getting unity CSI Driver for Dell Technologies  {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z    DEBUG   drivers/commonconfig.go:333 GetConfigMap    {"TraceId": "<omitted>-unity-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/unity/v2.7.0/driver-config-params.yaml"}
2023-09-13T12:24:36.227Z    DEBUG   drivers/commonconfig.go:368 GetCSIDriver    {"TraceId": "<omitted>-unity-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/unity/v2.7.0/csidriver.yaml"}
2023-09-13T12:24:36.228Z    DEBUG   drivers/commonconfig.go:390 GetCSIDriver    {"TraceId": "<omitted>-unity-1", "fsGroupPolicy": "ReadWriteOnceWithFSType"}
2023-09-13T12:24:36.228Z    DEBUG   drivers/commonconfig.go:176 GetNode {"TraceId": "<omitted>-unity-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/unity/v2.7.0/node.yaml"}
[BRUH] toleration t: {Key:node-role.kubernetes.io/infra Operator:Exists Value: Effect:NoSchedule TolerationSeconds:<nil>}
2023-09-13T12:24:36.232Z    DEBUG   drivers/commonconfig.go:40  GetController   {"TraceId": "<omitted>-unity-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/unity/v2.7.0/controller.yaml"}
2023-09-13T12:24:36.302Z    ERROR   zap@v1.21.0/sugar.go:173    Ignored key without a value.    {"TraceId": "<omitted>-unity-1", "ignored": {"driver":{"csiDriverType":"unity","csiDriverSpec":{"fSGroupPolicy":"ReadWriteOnceWithFSType"},"configVersion":"v2.7.0","replicas":2,"dnsPolicy":"ClusterFirstWithHostNet","common":{"image":"dellemc/csi-unity:v2.7.0","imagePullPolicy":"IfNotPresent","envs":[{"name":"X_CSI_UNITY_ALLOW_MULTI_POD_ACCESS","value":"false"},{"name":"X_CSI_EPHEMERAL_STAGING_PATH","value":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/"},{"name":"X_CSI_ISCSI_CHROOT","value":"/noderoot"},{"name":"X_CSI_UNITY_SYNC_NODEINFO_INTERVAL","value":"15"},{"name":"KUBELET_CONFIG_DIR","value":"/var/lib/kubelet"},{"name":"CSI_LOG_LEVEL","value":"info"},{"name":"TENANT_NAME"},{"name":"CERT_SECRET_COUNT","value":"1"},{"name":"X_CSI_UNITY_SKIP_CERTIFICATE_VALIDATION","value":"true"}]},"controller":{"envs":[{"name":"X_CSI_HEALTH_MONITOR_ENABLED","value":"true"}],"tolerations":[{"key":"node-role.kubernetes.io/infra","operator":"Exists","effect":"NoSchedule"}]},"node":{"envs":[{"name":"X_CSI_HEALTH_MONITOR_ENABLED","value":"true"}],"tolerations":[{"key":"node-role.kubernetes.io/infra","operator":"Exists","effect":"NoSchedule"}]},"sideCars":[{"name":"external-health-monitor","enabled":false,"args":["--monitor-interval=60s"]}],"forceRemoveDriver":true}}}
2023-09-13T12:24:36.302Z    DEBUG   drivers/commonconfig.go:51  DriverSpec  {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.306Z    DEBUG   drivers/commonconfig.go:72  Adding toleration   {"TraceId": "<omitted>-unity-1", "t": {"key":"node-role.kubernetes.io/infra","operator":"Exists","effect":"NoSchedule"}}
2023-09-13T12:24:36.306Z    INFO    drivers/commonconfig.go:111 Container to be removed {"TraceId": "<omitted>-unity-1", "name": "external-health-monitor"}
2023-09-13T12:24:36.306Z    INFO    controllers/csm_controller.go:530   Checking if standalone modules need clean up    {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.316Z    INFO    controllers/csm_controller.go:723   Starting SYNC for default-source-cluster cluster    {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.518Z    INFO    serviceaccount/serviceaccount.go:45 ServiceAccount already exists   {"TraceId": "<omitted>-unity-1", "Name:": "<omitted>-unity-node"}
2023-09-13T12:24:36.518Z    INFO    serviceaccount/serviceaccount.go:45 ServiceAccount already exists   {"TraceId": "<omitted>-unity-1", "Name:": "<omitted>-unity-controller"}
2023-09-13T12:24:36.619Z    INFO    rbac/clusterrole.go:45  Updating ClusterRoleName:<omitted>-unity-node   {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.658Z    INFO    rbac/clusterrole.go:45  Updating ClusterRoleName:<omitted>-unity-controller {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.794Z    INFO    rbac/rolebindings.go:40 Updating ClusterRoleBindingName:<omitted>-unity-node    {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.835Z    INFO    rbac/rolebindings.go:40 Updating ClusterRoleBindingName:<omitted>-unity-controller  {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.973Z    INFO    csidriver/csidriver.go:41   CSIDriver Object exist  {"TraceId": "<omitted>-unity-1", "Name:": "csi-unity.dellemc.com"}

jooseppi-luna commented 1 year ago

Thanks for the logs! We will investigate to see if we can replicate the issue and decide if we should bump up the limits in an upcoming release. One thing I noticed is that the health monitor sidecar is disabled, but the health monitor env var is enabled for controller and node -- is that intentional/what use case is that?

bharathsreekanth commented 11 months ago

@chimanjain @jooseppi-luna Do we have any internal ticket to track this? If so, then we need to move this query from a question to an appropriate bucket in GH.

cassanellicarlo commented 10 months ago

@jooseppi-luna any news on this?

jooseppi-luna commented 9 months ago

@cassanellicarlo sorry for the late follow up! We have increased the limits in the upcoming CSM 1.9 release (csm-operator v1.4.0). If you have any further questions or issues, please file them here and we will get to it asap.

dell / csm

[BUG]: Update resources limits for controller-manager to fix OOMKilled error #982

How can the Team help you today?