Closed cassanellicarlo closed 9 months ago
Hi Carlo, thanks for the question. Are you getting OOM-killed before you do anything with the operator or are you getting killed while trying to do a bunch of stuff with it? Do you have any relevant logs?
Looks like I am able to add additional memory by editing line 921 of the deploy/operator.yaml
file. After editing that line to 512Mi and reinstalling, I get the following when I describe the controller-manager pod (snipped for readability):
[root@master-1-095zyzFtPRfV5 csm-operator]# k describe pod -n dell-csm-operator dell-csm-operator-controller-manager-6bd6569b56-bqbs5
...
Containers:
manager:
Container ID: containerd://17f0b8031735e468fdb066ae31e119b174a5ab567a4d7d69aa386714b4701f62
Image: docker.io/dellemc/dell-csm-operator:v1.2.0
Image ID: docker.io/dellemc/dell-csm-operator@sha256:814895bdff2f49c0f9a7789490e6316688f85e3cab2c0a6215fa0f68034c5f32
Port: <none>
Host Port: <none>
Command:
/manager
Args:
--leader-elect
State: Running
Started: Wed, 13 Sep 2023 09:40:25 -0400
Ready: True
Restart Count: 0
Limits:
cpu: 200m
memory: 512Mi
Requests:
cpu: 100m
memory: 192Mi
...
If you could provide us with details of anything else that you might have installed on the system, as well as what all the operator has done leading up to the OOM kill, that would be super helpful! Thanks.
Starting logs of controller manager:
2023-09-13T12:35:32.398Z DEBUG workspace/main.go:79 Operator Version {"TraceId": "main", "Version": "1.2.0", "Commit ID": "081702a4c6969af8038a31eaf072b13554323f51", "Commit SHA": "Fri, 23 Jun 2023 07:46:51 UTC"} 2023-09-13T12:35:32.398Z DEBUG workspace/main.go:80 Go Version: go1.20.5 {"TraceId": "main"} 2023-09-13T12:35:32.398Z DEBUG workspace/main.go:81 Go OS/Arch: linux/amd64 {"TraceId": "main"} I0913 12:35:33.500640 1 request.go:665] Waited for 1.01097461s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/logging.openshift.io/v1 2023-09-13T12:35:39.751Z INFO workspace/main.go:93 Openshift environment {"TraceId": "main"} 2023-09-13T12:35:39.753Z INFO workspace/main.go:132 Current kubernetes version is 1.25 which is a supported version {"TraceId": "main"} 2023-09-13T12:35:39.754Z INFO workspace/main.go:143 Use ConfigDirectory /etc/config/dell-csm-operator {"TraceId": "main"} I0913 12:35:43.505285 1 request.go:665] Waited for 3.743544237s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/apps.gitlab.com/v1beta2?timeout=32s 1.6946085471103349e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": "127.0.0.1:8080"} 1.6946085471115081e+09 INFO setup starting manager 1.694608547111709e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"} 1.694608547111713e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
Previous last logs from restarted container
[BRUH] toleration t: {Key:node-role.kubernetes.io/infra Operator:Exists Value: Effect:NoSchedule TolerationSeconds:<nil>} 2023-09-13T12:35:05.760Z DEBUG drivers/commonconfig.go:40 GetController {"TraceId": "<omitted>-unity-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/unity/v2.7.0/controller.yaml"} 2023-09-13T12:35:05.760Z ERROR zap@v1.21.0/sugar.go:173 Ignored key without a value. {"TraceId": "<omitted>-unity-1", "ignored": {"driver":{"csiDriverType":"unity","csiDriverSpec":{"fSGroupPolicy":"ReadWriteOnceWithFSType"},"configVersion":"v2.7.0","replicas":2,"dnsPolicy":"ClusterFirstWithHostNet","common":{"image":"dellemc/csi-unity:v2.7.0","imagePullPolicy":"IfNotPresent","envs":[{"name":"X_CSI_UNITY_ALLOW_MULTI_POD_ACCESS","value":"false"},{"name":"X_CSI_EPHEMERAL_STAGING_PATH","value":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/"},{"name":"X_CSI_ISCSI_CHROOT","value":"/noderoot"},{"name":"X_CSI_UNITY_SYNC_NODEINFO_INTERVAL","value":"15"},{"name":"KUBELET_CONFIG_DIR","value":"/var/lib/kubelet"},{"name":"CSI_LOG_LEVEL","value":"info"},{"name":"CERT_SECRET_COUNT","value":"1"},{"name":"X_CSI_UNITY_SKIP_CERTIFICATE_VALIDATION","value":"true"}]},"controller":{"envs":[{"name":"X_CSI_HEALTH_MONITOR_ENABLED","value":"true"}],"tolerations":[{"key":"node-role.kubernetes.io/infra","operator":"Exists","effect":"NoSc... 2023-09-13T12:35:05.760Z DEBUG drivers/commonconfig.go:51 DriverSpec {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:05.764Z DEBUG drivers/commonconfig.go:72 Adding toleration {"TraceId": "<omitted>-unity-1", "t": {"key":"node-role.kubernetes.io/infra","operator":"Exists","effect":"NoSchedule"}} 2023-09-13T12:35:05.764Z INFO drivers/commonconfig.go:111 Container to be removed {"TraceId": "<omitted>-unity-1", "name": "external-health-monitor"} 2023-09-13T12:35:05.764Z INFO controllers/csm_controller.go:530 Checking if standalone modules need clean up {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:05.775Z INFO controllers/csm_controller.go:723 Starting SYNC for default-source-cluster cluster {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:05.976Z INFO serviceaccount/serviceaccount.go:45 ServiceAccount already exists {"TraceId": "<omitted>-unity-1", "Name:": "<omitted>-unity-node"} 2023-09-13T12:35:05.976Z INFO serviceaccount/serviceaccount.go:45 ServiceAccount already exists {"TraceId": "<omitted>-unity-1", "Name:": "<omitted>-unity-controller"} 2023-09-13T12:35:06.077Z INFO rbac/clusterrole.go:45 Updating ClusterRoleName:<omitted>-unity-node {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:06.111Z INFO rbac/clusterrole.go:45 Updating ClusterRoleName:<omitted>-unity-controller {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:06.242Z INFO rbac/rolebindings.go:40 Updating ClusterRoleBindingName:<omitted>-unity-node {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:06.275Z INFO rbac/rolebindings.go:40 Updating ClusterRoleBindingName:<omitted>-unity-controller {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:06.407Z INFO csidriver/csidriver.go:41 CSIDriver Object exist {"TraceId": "<omitted>-unity-1", "Name:": "csi-unity.dellemc.com"}
The only error i'm seeing is "ERROR zap@v1.21.0/sugar.go:173 Ignored key without a value. " but i don't know if that is related.
I'm installing the Operator via OLM Subscription. I'm not using the operator.yaml
Metrics of the controller manager
I manually changed the limits in the operator yaml from the OpenShift console
- resources:
limits:
cpu: 200m
memory: 500Mi
requests:
cpu: 100m
memory: 200Mi
and now the controller manager seems to work fine without restarting. But that's not a good way to set it.
Ok that's good, I'm glad it's at least not getting killed right now. I agree that that's not a good long-term solution, we will work on a better fix and keep this issue updated.
@jooseppi-luna can you confirm if this is same as https://github.com/dell/csm/issues/184?
@bharathsreekanth it's related but not the same, https://github.com/dell/csm/issues/184 is for adding resource limits to helm charts. These resource limits already exist in operator and are what we are adjusting here to make the deployment work. See here for where we set them in operator.
@cassanellicarlo I spoke with @rensyct and it would help us to have these three things from you to figure this out:
1) Details on everything you installed/attempted to install with operator before it got killed.
2) Attach the sample files you used to install any drivers/modules you are installing (e.g., I can see you are installing csi-unity, can you attach your edited sample file for us to review and test on our end).
3) Attach complete operator logs from your controller-manager, you can get them like this: kubectl logs dell-csm-operator-controller-manager-xxxxxxxxxx-xxxxx -n dell-csm-operator > operator-logs.txt
(fill in your pod name and namespace).
Operator: dell-csm-operator-certified.v1.2.0
ContainerStorageModule
`apiVersion: storage.dell.com/v1
kind: ContainerStorageModule
metadata:
name:
namespace: {{ .Values.namespace }}
spec:
driver:
csiDriverType: "unity"
csiDriverSpec:
# fsGroupPolicy: Defines if the underlying volume supports changing ownership and permission of the volume before being mounted.
# Allowed values: ReadWriteOnceWithFSType, File , None
# Default value: ReadWriteOnceWithFSType
fSGroupPolicy: "ReadWriteOnceWithFSType"
# Config version for CSI Unity v2.7.0 driver
configVersion: {{ .Values.driver.release }}
# Controller count
replicas: 2
dnsPolicy: ClusterFirstWithHostNet
forceUpdate: false
forceRemoveDriver: true
common:
# Image for CSI Unity driver v2.7.0
image: "dellemc/csi-unity:{{ .Values.driver.release }}"
imagePullPolicy: IfNotPresent
envs:
# X_CSI_UNITY_ALLOW_MULTI_POD_ACCESS - Flag to enable sharing of volumes across multiple pods within the same node in RWO access mode.
# Allowed values: boolean
# Default value: "false"
# Examples : "true" , "false"
- name: X_CSI_UNITY_ALLOW_MULTI_POD_ACCESS
value: "false"
- name: X_CSI_EPHEMERAL_STAGING_PATH
value: "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/"
# X_CSI_ISCSI_CHROOT is the path to which the driver will chroot before
# running any iscsi commands. This value should only be set when instructed
# by technical support
- name: X_CSI_ISCSI_CHROOT
value: "/noderoot"
# X_CSI_UNITY_SYNC_NODEINFO_INTERVAL - Time interval to add node info to array. Default 15 minutes. Minimum value should be 1.
# Allowed values: integer
# Default value: 15
# Examples : 0 , 2
- name: X_CSI_UNITY_SYNC_NODEINFO_INTERVAL
value: "15"
# Specify kubelet config dir path.
# Ensure that the config.yaml file is present at this path.
# Default value: None
- name: KUBELET_CONFIG_DIR
value: /var/lib/kubelet
# CSI_LOG_LEVEL is used to set the logging level of the driver.
# Allowed values: "error", "warn"/"warning", "info", "debug"
# Default value: "info"
- name: CSI_LOG_LEVEL
value: {{ .Values.logLevel }}
# TENANT_NAME - Tenant name that need to added while adding host entry to the array.
# Allowed values: string
# Default value: ""
# Examples : "tenant2" , "tenant3"
- name: TENANT_NAME
value: ""
# CERT_SECRET_COUNT: Represents number of certificate secrets, which user is going to create for
# ssl authentication. (unity-cert-0..unity-cert-n)
# This field is only verified if X_CSI_UNITY_SKIP_CERTIFICATE_VALIDATION is set to false
# Allowed values: n, where n > 0
# Default value: None
- name: CERT_SECRET_COUNT
value: "1"
# X_CSI_UNITY_SKIP_CERTIFICATE_VALIDATION: Specifies if the driver is going to validate unisphere certs while connecting to the Unisphere REST API interface.
# If it is set to false, then a secret unity-certs has to be created with an X.509 certificate of CA which signed the Unisphere certificate
# Allowed values:
# true: skip Unisphere API server's certificate verification
# false: verify Unisphere API server's certificates
# Default value: true
- name: X_CSI_UNITY_SKIP_CERTIFICATE_VALIDATION
value: "true"
sideCars:
# health monitor is disabled by default, refer to driver documentation before enabling it
- name: external-health-monitor
enabled: false
args: ["--monitor-interval=60s"]
controller:
envs:
# X_CSI_HEALTH_MONITOR_ENABLED: Enable/Disable health monitor of CSI volumes from Controller plugin - volume condition.
# Install the 'external-health-monitor' sidecar accordingly.
# Allowed values:
# true: enable checking of health condition of CSI volumes
# false: disable checking of health condition of CSI volumes
# Default value: false
- name: X_CSI_HEALTH_MONITOR_ENABLED
value: "true"
#nodeSelector:
# Uncomment if nodes you wish to use have the node-role.kubernetes.io/control-plane taint
# node-role.kubernetes.io/control-plane: ""
# tolerations: Define tolerations for the controllers, if required.
# Leave as blank to install controller on worker nodes
# Default value: None
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/infra
operator: Exists
node:
envs:
# X_CSI_HEALTH_MONITOR_ENABLED: Enable/Disable health monitor of CSI volumes from node plugin - volume usage
# Allowed values:
# true: enable checking of health condition of CSI volumes
# false: disable checking of health condition of CSI volumes
# Default value: false
- name: X_CSI_HEALTH_MONITOR_ENABLED
value: "true"
# nodeSelector: Define node selection constraints for node pods.
# For the pod to be eligible to run on a node, the node must have each
# of the indicated key-value pairs as labels.
# Leave as blank to consider all nodes
# Allowed values: map of key-value pairs
# Default value: None
#nodeSelector:
# Uncomment if nodes you wish to use have the node-role.kubernetes.io/control-plane taint
# node-role.kubernetes.io/control-plane: ""
# tolerations: Define tolerations for the controllers, if required.
# Leave as blank to install controller on worker nodes
# Default value: None
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/infra
operator: Exists`
Logs
2023-09-13T12:24:00.532Z DEBUG workspace/main.go:79 Operator Version {"TraceId": "main", "Version": "1.2.0", "Commit ID": "081702a4c6969af8038a31eaf072b13554323f51", "Commit SHA": "Fri, 23 Jun 2023 07:46:51 UTC"}
2023-09-13T12:24:00.532Z DEBUG workspace/main.go:80 Go Version: go1.20.5 {"TraceId": "main"}
2023-09-13T12:24:00.532Z DEBUG workspace/main.go:81 Go OS/Arch: linux/amd64 {"TraceId": "main"}
I0913 12:24:01.656484 1 request.go:665] Waited for 1.042073415s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/ingress.operator.openshift.io/v1
2023-09-13T12:24:07.908Z INFO workspace/main.go:93 Openshift environment {"TraceId": "main"}
2023-09-13T12:24:07.911Z INFO workspace/main.go:132 Current kubernetes version is 1.25 which is a supported version {"TraceId": "main"}
2023-09-13T12:24:07.911Z INFO workspace/main.go:143 Use ConfigDirectory /etc/config/dell-csm-operator {"TraceId": "main"}
I0913 12:24:11.663191 1 request.go:665] Waited for 3.742098704s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/performance.openshift.io/v1alpha1?timeout=32s
1.6946078552691364e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": "127.0.0.1:8080"}
1.6946078552714586e+09 INFO setup starting manager
1.6946078552724323e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
I0913 12:24:15.272451 1 leaderelection.go:248] attempting to acquire leader lease dell-csm/090cae6a.dell.com...
1.6946078552724354e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
I0913 12:24:34.016487 1 leaderelection.go:258] successfully acquired lease dell-csm/090cae6a.dell.com
1.6946078740166562e+09 INFO controller.containerstoragemodule Starting EventSource {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "source": "kind source: *v1.ContainerStorageModule"}
1.694607874016701e+09 INFO controller.containerstoragemodule Starting Controller {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule"}
1.6946078740167027e+09 DEBUG events Normal {"object": {"kind":"ConfigMap","namespace":"dell-csm","name":"090cae6a.dell.com","uid":"80857b55-a5bd-405f-91f6-9e50580ecc85","apiVersion":"v1","resourceVersion":"1282335978"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-7b8dc694fd-9vh5n_de44f76c-dbdc-4a1a-8624-e1420fff6861 became leader"}
1.6946078740168204e+09 DEBUG events Normal {"object": {"kind":"Lease","namespace":"dell-csm","name":"090cae6a.dell.com","uid":"3ae38669-e585-44cf-9ae1-7a7cc849e250","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1282335979"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-7b8dc694fd-9vh5n_de44f76c-dbdc-4a1a-8624-e1420fff6861 became leader"}
1.6946078741176162e+09 INFO controller.containerstoragemodule Starting workers {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "worker count": 1}
2023-09-13T12:24:34.117Z INFO controllers/csm_controller.go:203 ################Starting Reconcile############## {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:34.117Z INFO controllers/csm_controller.go:206 reconcile for {"TraceId": "<omitted>-unity-1", "Namespace": "dell-csm", "Name": "<omitted>-unity", "Attempt": 1}
2023-09-13T12:24:34.117Z DEBUG drivers/unity.go:88 preCheck {"TraceId": "<omitted>-unity-1", "secrets": 1, "certCount": 1, "Namespace": "dell-csm"}
2023-09-13T12:24:35.918Z INFO controllers/csm_controller.go:1202 proceeding with modification of driver install {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:35.923Z INFO controllers/csm_controller.go:1130 Owner reference is found and matches {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:35.923Z INFO utils/status.go:156
daemonset status for cluster: default-source-cluster {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.226Z INFO utils/status.go:181 daemonset pod <omitted>-unity-node-dzj26 : Running {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.226Z INFO utils/status.go:181 daemonset pod <omitted>-unity-node-7lr7z : Running {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z INFO utils/status.go:181 daemonset pod <omitted>-unity-node-rz2xj : Running {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z INFO utils/status.go:181 daemonset pod <omitted>-unity-node-p2gr5 : Running {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z INFO utils/status.go:181 daemonset pod <omitted>-unity-node-wxk4t : Running {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z INFO utils/status.go:181 daemonset pod <omitted>-unity-node-gzs7r : Running {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z INFO utils/status.go:181 daemonset pod <omitted>-unity-node-srh72 : Running {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z INFO utils/status.go:204 daemonset status available pods 7 {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z INFO utils/status.go:205 daemonset status failedCount pods 0 {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z INFO utils/status.go:206 daemonset status desired pods 7 {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z INFO utils/status.go:239 deployment controllerReplicas [2] {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z INFO utils/status.go:240 deployment controllerStatus.Available [2] {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z INFO utils/status.go:242 daemonset expected [7] {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z INFO utils/status.go:243 daemonset nodeStatus.Available [7] {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z INFO utils/status.go:249 calculate overall state [Succeeded] {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z INFO utils/status.go:277 Driver State {"TraceId": "<omitted>-unity-1", "Controller": {"available":"2","desired":"2","failed":"0"}, "Node": {"available":"7","desired":"7","failed":"0"}}
2023-09-13T12:24:36.227Z INFO utils/status.go:365 HandleSuccess Driver state {"TraceId": "<omitted>-unity-1", "newStatus.State": "Running"}
2023-09-13T12:24:36.227Z INFO utils/status.go:369 HandleSuccess Driver state didn't change from Running {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z INFO controllers/csm_controller.go:887 Getting unity CSI Driver for Dell Technologies {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z DEBUG drivers/commonconfig.go:333 GetConfigMap {"TraceId": "<omitted>-unity-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/unity/v2.7.0/driver-config-params.yaml"}
2023-09-13T12:24:36.227Z DEBUG drivers/commonconfig.go:368 GetCSIDriver {"TraceId": "<omitted>-unity-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/unity/v2.7.0/csidriver.yaml"}
2023-09-13T12:24:36.228Z DEBUG drivers/commonconfig.go:390 GetCSIDriver {"TraceId": "<omitted>-unity-1", "fsGroupPolicy": "ReadWriteOnceWithFSType"}
2023-09-13T12:24:36.228Z DEBUG drivers/commonconfig.go:176 GetNode {"TraceId": "<omitted>-unity-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/unity/v2.7.0/node.yaml"}
[BRUH] toleration t: {Key:node-role.kubernetes.io/infra Operator:Exists Value: Effect:NoSchedule TolerationSeconds:<nil>}
2023-09-13T12:24:36.232Z DEBUG drivers/commonconfig.go:40 GetController {"TraceId": "<omitted>-unity-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/unity/v2.7.0/controller.yaml"}
2023-09-13T12:24:36.302Z ERROR zap@v1.21.0/sugar.go:173 Ignored key without a value. {"TraceId": "<omitted>-unity-1", "ignored": {"driver":{"csiDriverType":"unity","csiDriverSpec":{"fSGroupPolicy":"ReadWriteOnceWithFSType"},"configVersion":"v2.7.0","replicas":2,"dnsPolicy":"ClusterFirstWithHostNet","common":{"image":"dellemc/csi-unity:v2.7.0","imagePullPolicy":"IfNotPresent","envs":[{"name":"X_CSI_UNITY_ALLOW_MULTI_POD_ACCESS","value":"false"},{"name":"X_CSI_EPHEMERAL_STAGING_PATH","value":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/"},{"name":"X_CSI_ISCSI_CHROOT","value":"/noderoot"},{"name":"X_CSI_UNITY_SYNC_NODEINFO_INTERVAL","value":"15"},{"name":"KUBELET_CONFIG_DIR","value":"/var/lib/kubelet"},{"name":"CSI_LOG_LEVEL","value":"info"},{"name":"TENANT_NAME"},{"name":"CERT_SECRET_COUNT","value":"1"},{"name":"X_CSI_UNITY_SKIP_CERTIFICATE_VALIDATION","value":"true"}]},"controller":{"envs":[{"name":"X_CSI_HEALTH_MONITOR_ENABLED","value":"true"}],"tolerations":[{"key":"node-role.kubernetes.io/infra","operator":"Exists","effect":"NoSchedule"}]},"node":{"envs":[{"name":"X_CSI_HEALTH_MONITOR_ENABLED","value":"true"}],"tolerations":[{"key":"node-role.kubernetes.io/infra","operator":"Exists","effect":"NoSchedule"}]},"sideCars":[{"name":"external-health-monitor","enabled":false,"args":["--monitor-interval=60s"]}],"forceRemoveDriver":true}}}
2023-09-13T12:24:36.302Z DEBUG drivers/commonconfig.go:51 DriverSpec {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.306Z DEBUG drivers/commonconfig.go:72 Adding toleration {"TraceId": "<omitted>-unity-1", "t": {"key":"node-role.kubernetes.io/infra","operator":"Exists","effect":"NoSchedule"}}
2023-09-13T12:24:36.306Z INFO drivers/commonconfig.go:111 Container to be removed {"TraceId": "<omitted>-unity-1", "name": "external-health-monitor"}
2023-09-13T12:24:36.306Z INFO controllers/csm_controller.go:530 Checking if standalone modules need clean up {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.316Z INFO controllers/csm_controller.go:723 Starting SYNC for default-source-cluster cluster {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.518Z INFO serviceaccount/serviceaccount.go:45 ServiceAccount already exists {"TraceId": "<omitted>-unity-1", "Name:": "<omitted>-unity-node"}
2023-09-13T12:24:36.518Z INFO serviceaccount/serviceaccount.go:45 ServiceAccount already exists {"TraceId": "<omitted>-unity-1", "Name:": "<omitted>-unity-controller"}
2023-09-13T12:24:36.619Z INFO rbac/clusterrole.go:45 Updating ClusterRoleName:<omitted>-unity-node {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.658Z INFO rbac/clusterrole.go:45 Updating ClusterRoleName:<omitted>-unity-controller {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.794Z INFO rbac/rolebindings.go:40 Updating ClusterRoleBindingName:<omitted>-unity-node {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.835Z INFO rbac/rolebindings.go:40 Updating ClusterRoleBindingName:<omitted>-unity-controller {"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.973Z INFO csidriver/csidriver.go:41 CSIDriver Object exist {"TraceId": "<omitted>-unity-1", "Name:": "csi-unity.dellemc.com"}
Thanks for the logs! We will investigate to see if we can replicate the issue and decide if we should bump up the limits in an upcoming release. One thing I noticed is that the health monitor sidecar is disabled, but the health monitor env var is enabled for controller and node -- is that intentional/what use case is that?
@chimanjain @jooseppi-luna Do we have any internal ticket to track this? If so, then we need to move this query from a question to an appropriate bucket in GH.
@jooseppi-luna any news on this?
@cassanellicarlo sorry for the late follow up! We have increased the limits in the upcoming CSM 1.9 release (csm-operator v1.4.0). If you have any further questions or issues, please file them here and we will get to it asap.
How can the Team help you today?
Details: ?
I'm using dell-csm-operator-certified.v1.2.0 operator on OpenShift 4.12. I installed it successfully, but the controller-manager is getting OOM-killed because it's consuming more memory than the limit set.
The default limit for the container is set to 256Mi. How can one increase it in the ContainerStorageModule resource?