dell / csm

Dell Container Storage Modules (CSM)
Apache License 2.0
67 stars 15 forks source link

[BUG]: Dell Container Storage Modules Operator doesn't create namespace for karavi when Observability for PowerScale is active which OOMKills the dell-csm-operator-controller-manager pod, manager container #728

Closed grvn closed 1 year ago

grvn commented 1 year ago

Bug Description

Installing the Dell Container Storage Modules Operator and creating a ContainerStorageModule for PowerScale v2.5 and activating Observability v1.4.0 makes the container "manager" in the "dell-csm-operator-controller-manager" pod use more memory than its limit which triggers the OOMKiller. Even if I try to increase the limit to 1 GB memory it still gets OOMKilled.

The problem seems to be that the operator fails to create the namespace "karavi".

Workaround: Manually create the namespace "karavi"

Logs

journal from the node

worker1 hyperkube[2810]: I0324 15:03:44.053555    2810 kubelet.go:2095] "SyncLoop ADD" source="api" pods=[dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx]
worker1 hyperkube[2810]: I0324 15:03:44.053595    2810 topology_manager.go:200] "Topology Admit Handler"
worker1 systemd[1]: Created slice libcontainer container kubepods-burstable-podb51948d8_9fdb_4dee_b012_822fdcfd1fcc.slice.
worker1 hyperkube[2810]: I0324 15:03:44.098253    2810 reconciler.go:342] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-hhc2s\" (UniqueName: \"kubernetes.io/projected/b51948d8-9fdb-4dee-b012-822fdcfd1fcc-kube-api-access-hhc2s\") pod \"dell-csm-operator-controller-manager-5f8444d888-9ktqx\" (UID: \"b51948d8-9fdb-4dee-b012-822fdcfd1fcc\") " pod="dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx"
worker1 hyperkube[2810]: I0324 15:03:44.173133    2810 kubelet.go:2102] "SyncLoop UPDATE" source="api" pods=[dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx]
worker1 hyperkube[2810]: I0324 15:03:44.199499    2810 reconciler.go:254] "operationExecutor.MountVolume started for volume \"kube-api-access-hhc2s\" (UniqueName: \"kubernetes.io/projected/b51948d8-9fdb-4dee-b012-822fdcfd1fcc-kube-api-access-hhc2s\") pod \"dell-csm-operator-controller-manager-5f8444d888-9ktqx\" (UID: \"b51948d8-9fdb-4dee-b012-822fdcfd1fcc\") " pod="dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx"
worker1 hyperkube[2810]: I0324 15:03:44.227470    2810 operation_generator.go:703] "MountVolume.SetUp succeeded for volume \"kube-api-access-hhc2s\" (UniqueName: \"kubernetes.io/projected/b51948d8-9fdb-4dee-b012-822fdcfd1fcc-kube-api-access-hhc2s\") pod \"dell-csm-operator-controller-manager-5f8444d888-9ktqx\" (UID: \"b51948d8-9fdb-4dee-b012-822fdcfd1fcc\") " pod="dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx"
worker1 hyperkube[2810]: I0324 15:03:44.364759    2810 kuberuntime_manager.go:469] "No sandbox for pod can be found. Need to start a new one" pod="dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx"
worker1 crio[2768]: time="2023-03-24 15:03:44.365762201Z" level=info msg="Running pod sandbox: dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx/POD" id=1349bb08-c703-475b-85d0-01fae9fb6985 name=/runtime.v1.RuntimeService/RunPodSandbox
worker1 crio[2768]: time="2023-03-24 15:03:44.365803881Z" level=warning msg="Allowed annotations are specified for workload [io.containers.trace-syscall]"
worker1 crio[2768]: time="2023-03-24 15:03:44.380900337Z" level=info msg="Got pod network &{Name:dell-csm-operator-controller-manager-5f8444d888-9ktqx Namespace:dell-csm-operators ID:5bf793be3c87d07693546f288796dbec38d0cc48648bdfa4355af40bb597a45a UID:b51948d8-9fdb-4dee-b012-822fdcfd1fcc NetNS:/var/run/netns/c330428a-1a45-4770-9973-b94438fe90df Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
worker1 crio[2768]: time="2023-03-24 15:03:44.380925269Z" level=info msg="Adding pod dell-csm-operators_dell-csm-operator-controller-manager-5f8444d888-9ktqx to CNI network \"multus-cni-network\" (type=multus)"
worker1 kernel: IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
worker1 kernel: IPv6: ADDRCONF(NETDEV_UP): 5bf793be3c87d07: link is not ready
worker1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): 5bf793be3c87d07: link becomes ready
worker1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
worker1 systemd-udevd[158178]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
worker1 systemd-udevd[158178]: Could not generate persistent MAC address for 5bf793be3c87d07: No such file or directory
worker1 NetworkManager[1398]: <info>  [1679670224.5154] device (5bf793be3c87d07): carrier: link connected
worker1 NetworkManager[1398]: <info>  [1679670224.5157] manager: (5bf793be3c87d07): new Veth device (/org/freedesktop/NetworkManager/Devices/80)
worker1 ovs-vswitchd[1352]: ovs|00198|bridge|INFO|bridge br-int: added interface 5bf793be3c87d07 on port 42
worker1 kernel: device 5bf793be3c87d07 entered promiscuous mode
worker1 NetworkManager[1398]: <info>  [1679670224.5326] manager: (5bf793be3c87d07): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/81)
worker1 crio[2768]: I0324 15:03:44.504288  158163 ovs.go:91] Maximum command line arguments set to: 191102
worker1 crio[2768]: 2023-03-24T15:03:46Z [verbose] Add: dell-csm-operators:dell-csm-operator-controller-manager-5f8444d888-9ktqx:b51948d8-9fdb-4dee-b012-822fdcfd1fcc:ovn-kubernetes(ovn-kubernetes):eth0 {"cniVersion":"0.4.0","interfaces":[{"name":"5bf793be3c87d07","mac":"fe:93:0b:1a:c3:10"},{"name":"eth0","mac":"0a:58:0a:a1:0a:17","sandbox":"/var/run/netns/c330428a-1a45-4770-9973-b94438fe90df"}],"ips":[{"version":"4","interface":1,"address":"10.161.10.23/23","gateway":"10.161.10.1"}],"dns":{}}
worker1 crio[2768]: I0324 15:03:46.596987  158152 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"dell-csm-operators", Name:"dell-csm-operator-controller-manager-5f8444d888-9ktqx", UID:"b51948d8-9fdb-4dee-b012-822fdcfd1fcc", APIVersion:"v1", ResourceVersion:"1071538509", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.161.10.23/23] from ovn-kubernetes
worker1 crio[2768]: time="2023-03-24 15:03:46.633779469Z" level=info msg="Got pod network &{Name:dell-csm-operator-controller-manager-5f8444d888-9ktqx Namespace:dell-csm-operators ID:5bf793be3c87d07693546f288796dbec38d0cc48648bdfa4355af40bb597a45a UID:b51948d8-9fdb-4dee-b012-822fdcfd1fcc NetNS:/var/run/netns/c330428a-1a45-4770-9973-b94438fe90df Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
worker1 hyperkube[2810]: I0324 15:03:46.633856    2810 kubelet.go:2102] "SyncLoop UPDATE" source="api" pods=[dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx]
worker1 crio[2768]: time="2023-03-24 15:03:46.633892466Z" level=info msg="Checking pod dell-csm-operators_dell-csm-operator-controller-manager-5f8444d888-9ktqx for CNI network multus-cni-network (type=multus)"
worker1 crio[2768]: time="2023-03-24 15:03:46.637113856Z" level=info msg="Ran pod sandbox 5bf793be3c87d07693546f288796dbec38d0cc48648bdfa4355af40bb597a45a with infra container: dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx/POD" id=1349bb08-c703-475b-85d0-01fae9fb6985 name=/runtime.v1.RuntimeService/RunPodSandbox
worker1 crio[2768]: time="2023-03-24 15:03:46.638314556Z" level=info msg="Checking image status: gcr.io/kubebuilder/kube-rbac-proxy@sha256:db06cc4c084dd0253134f156dddaaf53ef1c3fb3cc809e5d81711baa4029ea4c" id=7df3412a-8456-432b-a4af-f0ddee060f65 name=/runtime.v1.ImageService/ImageStatus
worker1 crio[2768]: time="2023-03-24 15:03:46.638460462Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:ad393d6a4d1b1ccbce65c2a4db6064635a8aac883d9dd66a38e14ce925e93b7f,RepoTags:[],RepoDigests:[gcr.io/kubebuilder/kube-rbac-proxy@sha256:34e8724e0f47e31eb2ec3279ac398b657db5f60f167426ee73138e2e84af6486 gcr.io/kubebuilder/kube-rbac-proxy@sha256:db06cc4c084dd0253134f156dddaaf53ef1c3fb3cc809e5d81711baa4029ea4c],Size_:50203713,Uid:&Int64Value{Value:65532,},Username:,Spec:nil,},Info:map[string]string{},}" id=7df3412a-8456-432b-a4af-f0ddee060f65 name=/runtime.v1.ImageService/ImageStatus
worker1 crio[2768]: time="2023-03-24 15:03:46.638890044Z" level=info msg="Checking image status: gcr.io/kubebuilder/kube-rbac-proxy@sha256:db06cc4c084dd0253134f156dddaaf53ef1c3fb3cc809e5d81711baa4029ea4c" id=77eb150f-24bb-49f0-b920-abea75f28ffc name=/runtime.v1.ImageService/ImageStatus
worker1 crio[2768]: time="2023-03-24 15:03:46.639031684Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:ad393d6a4d1b1ccbce65c2a4db6064635a8aac883d9dd66a38e14ce925e93b7f,RepoTags:[],RepoDigests:[gcr.io/kubebuilder/kube-rbac-proxy@sha256:34e8724e0f47e31eb2ec3279ac398b657db5f60f167426ee73138e2e84af6486 gcr.io/kubebuilder/kube-rbac-proxy@sha256:db06cc4c084dd0253134f156dddaaf53ef1c3fb3cc809e5d81711baa4029ea4c],Size_:50203713,Uid:&Int64Value{Value:65532,},Username:,Spec:nil,},Info:map[string]string{},}" id=77eb150f-24bb-49f0-b920-abea75f28ffc name=/runtime.v1.ImageService/ImageStatus
worker1 crio[2768]: time="2023-03-24 15:03:46.639554437Z" level=info msg="Creating container: dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx/kube-rbac-proxy" id=ce58b448-d3dd-49f5-9102-98ca779fac85 name=/runtime.v1.RuntimeService/CreateContainer
worker1 crio[2768]: time="2023-03-24 15:03:46.639621133Z" level=warning msg="Allowed annotations are specified for workload [io.containers.trace-syscall]"
worker1 systemd[1]: Started crio-conmon-c2f64e4fc0687b5f10ed5a853523aafa19f42b81a24a98e53025453331bb7839.scope.
worker1 systemd[1]: Started libcontainer container c2f64e4fc0687b5f10ed5a853523aafa19f42b81a24a98e53025453331bb7839.
worker1 systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory
worker1 crio[2768]: time="2023-03-24 15:03:46.850275574Z" level=info msg="Created container c2f64e4fc0687b5f10ed5a853523aafa19f42b81a24a98e53025453331bb7839: dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx/kube-rbac-proxy" id=ce58b448-d3dd-49f5-9102-98ca779fac85 name=/runtime.v1.RuntimeService/CreateContainer
worker1 crio[2768]: time="2023-03-24 15:03:46.850645948Z" level=info msg="Starting container: c2f64e4fc0687b5f10ed5a853523aafa19f42b81a24a98e53025453331bb7839" id=92934745-0b00-4d3c-b997-6ef54a4ded92 name=/runtime.v1.RuntimeService/StartContainer
worker1 crio[2768]: time="2023-03-24 15:03:46.864968085Z" level=info msg="Started container" PID=158231 containerID=c2f64e4fc0687b5f10ed5a853523aafa19f42b81a24a98e53025453331bb7839 description=dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx/kube-rbac-proxy id=92934745-0b00-4d3c-b997-6ef54a4ded92 name=/runtime.v1.RuntimeService/StartContainer sandboxID=5bf793be3c87d07693546f288796dbec38d0cc48648bdfa4355af40bb597a45a
worker1 crio[2768]: time="2023-03-24 15:03:46.872478966Z" level=info msg="Checking image status: docker.io/dellemc/dell-csm-operator@sha256:8a7c76d650fedac90b3abec82e35697bd622f6f97781dbf0590966c905c775e6" id=68a10203-49e4-4b1a-8785-fb6d564bc317 name=/runtime.v1.ImageService/ImageStatus
worker1 crio[2768]: time="2023-03-24 15:03:46.872633077Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:ee9fe522f5a5c69d66dd5b0cde5adbc45d9f2f6fced9c1b306b9bad5f156df9e,RepoTags:[],RepoDigests:[docker.io/dellemc/dell-csm-operator@sha256:8a7c76d650fedac90b3abec82e35697bd622f6f97781dbf0590966c905c775e6],Size_:280736887,Uid:&Int64Value{Value:1001,},Username:,Spec:nil,},Info:map[string]string{},}" id=68a10203-49e4-4b1a-8785-fb6d564bc317 name=/runtime.v1.ImageService/ImageStatus
worker1 crio[2768]: time="2023-03-24 15:03:46.873133872Z" level=info msg="Pulling image: docker.io/dellemc/dell-csm-operator@sha256:8a7c76d650fedac90b3abec82e35697bd622f6f97781dbf0590966c905c775e6" id=fb98ebd3-3da2-4344-ae56-c5ecae46789e name=/runtime.v1.ImageService/PullImage
worker1 crio[2768]: time="2023-03-24 15:03:46.922549455Z" level=info msg="Trying to access \"docker.io/dellemc/dell-csm-operator@sha256:8a7c76d650fedac90b3abec82e35697bd622f6f97781dbf0590966c905c775e6\""
worker1 crio[2768]: time="2023-03-24 15:03:46.973190050Z" level=info msg="Pulled image: docker.io/dellemc/dell-csm-operator@sha256:8a7c76d650fedac90b3abec82e35697bd622f6f97781dbf0590966c905c775e6" id=fb98ebd3-3da2-4344-ae56-c5ecae46789e name=/runtime.v1.ImageService/PullImage
worker1 crio[2768]: time="2023-03-24 15:03:46.974547165Z" level=info msg="Checking image status: docker.io/dellemc/dell-csm-operator@sha256:8a7c76d650fedac90b3abec82e35697bd622f6f97781dbf0590966c905c775e6" id=ade0eecc-73b1-4514-a8d4-f251444451ac name=/runtime.v1.ImageService/ImageStatus
worker1 crio[2768]: time="2023-03-24 15:03:46.974679899Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:ee9fe522f5a5c69d66dd5b0cde5adbc45d9f2f6fced9c1b306b9bad5f156df9e,RepoTags:[],RepoDigests:[docker.io/dellemc/dell-csm-operator@sha256:8a7c76d650fedac90b3abec82e35697bd622f6f97781dbf0590966c905c775e6],Size_:280736887,Uid:&Int64Value{Value:1001,},Username:,Spec:nil,},Info:map[string]string{},}" id=ade0eecc-73b1-4514-a8d4-f251444451ac name=/runtime.v1.ImageService/ImageStatus
worker1 crio[2768]: time="2023-03-24 15:03:46.975356372Z" level=info msg="Creating container: dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx/manager" id=ef7ce817-b6b4-480c-9435-4aa8b6da25d9 name=/runtime.v1.RuntimeService/CreateContainer
worker1 crio[2768]: time="2023-03-24 15:03:46.975455687Z" level=warning msg="Allowed annotations are specified for workload [io.containers.trace-syscall]"
worker1 systemd[1]: Started crio-conmon-46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa.scope.
worker1 systemd[1]: Started libcontainer container 46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa.
worker1 systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory
worker1 hyperkube[2810]: I0324 15:03:47.068033    2810 kubelet.go:2133] "SyncLoop (PLEG): event for pod" pod="dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx" event=&{ID:b51948d8-9fdb-4dee-b012-822fdcfd1fcc Type:ContainerStarted Data:c2f64e4fc0687b5f10ed5a853523aafa19f42b81a24a98e53025453331bb7839}
worker1 hyperkube[2810]: I0324 15:03:47.068065    2810 kubelet.go:2133] "SyncLoop (PLEG): event for pod" pod="dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx" event=&{ID:b51948d8-9fdb-4dee-b012-822fdcfd1fcc Type:ContainerStarted Data:5bf793be3c87d07693546f288796dbec38d0cc48648bdfa4355af40bb597a45a}
worker1 crio[2768]: time="2023-03-24 15:03:47.162626350Z" level=info msg="Created container 46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa: dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx/manager" id=ef7ce817-b6b4-480c-9435-4aa8b6da25d9 name=/runtime.v1.RuntimeService/CreateContainer
worker1 crio[2768]: time="2023-03-24 15:03:47.162837553Z" level=info msg="Starting container: 46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa" id=e9f373be-9864-4296-9cb0-b6751a69b3a7 name=/runtime.v1.RuntimeService/StartContainer
worker1 crio[2768]: time="2023-03-24 15:03:47.176244294Z" level=info msg="Started container" PID=158298 containerID=46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa description=dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx/manager id=e9f373be-9864-4296-9cb0-b6751a69b3a7 name=/runtime.v1.RuntimeService/StartContainer sandboxID=5bf793be3c87d07693546f288796dbec38d0cc48648bdfa4355af40bb597a45a
worker1 hyperkube[2810]: I0324 15:03:48.070570    2810 kubelet.go:2133] "SyncLoop (PLEG): event for pod" pod="dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx" event=&{ID:b51948d8-9fdb-4dee-b012-822fdcfd1fcc Type:ContainerStarted Data:46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa}
worker1 hyperkube[2810]: I0324 15:03:49.072630    2810 kubelet.go:2205] "SyncLoop (probe)" probe="readiness" status="" pod="dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx"
worker1 ovs-vswitchd[1352]: ovs|00199|connmgr|INFO|br-ex<->unix#472: 2 flow_mods in the last 0 s (2 adds)
worker1 hyperkube[2810]: I0324 15:03:54.366530    2810 prober.go:121] "Probe failed" probeType="Readiness" pod="dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx" podUID=b51948d8-9fdb-4dee-b012-822fdcfd1fcc containerName="manager" probeResult=failure output="Get \"http://10.161.10.23:8081/readyz\": dial tcp 10.161.10.23:8081: connect: connection refused"
worker1 systemd[1]: run-runc-788bf42fbf68fd2c7141ce47be24f36002665dcfe7dd8555dfdd8dc8282a003c-runc.3WUmSs.mount: Succeeded.
worker1 hyperkube[2810]: I0324 15:04:04.365981    2810 kubelet.go:2205] "SyncLoop (probe)" probe="readiness" status="ready" pod="dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx"
worker1 ovs-vswitchd[1352]: ovs|00200|connmgr|INFO|br-ex<->unix#476: 2 flow_mods in the last 0 s (2 adds)
worker1 ovs-vswitchd[1352]: ovs|00202|connmgr|INFO|br-int<->unix#2: 69 flow_mods in the 7 s starting 55 s ago (34 adds, 35 deletes)
worker1 conmon[158285]: conmon 46e2a098359ac60961b5 <ninfo>: OOM event received
worker1 conmon[158285]: conmon 46e2a098359ac60961b5 <ninfo>: OOM received
worker1 kernel: manager invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), order=0, oom_score_adj=976
worker1 kernel: CPU: 7 PID: 158544 Comm: manager Tainted: P           OE    --------- -  - 4.18.0-372.43.1.el8_6.x86_64 #1
worker1 kernel: Hardware name: <redacted>
worker1 kernel: Call Trace:
worker1 kernel:  dump_stack+0x41/0x60
worker1 kernel:  dump_header+0x4a/0x1df
worker1 kernel:  oom_kill_process.cold.32+0xb/0x10
worker1 kernel:  out_of_memory+0x1bd/0x4e0
worker1 kernel:  mem_cgroup_out_of_memory+0xec/0x100
worker1 kernel:  try_charge+0x64f/0x690
worker1 kernel:  __mem_cgroup_charge+0x39/0xa0
worker1 kernel:  mem_cgroup_charge+0x2f/0x80
worker1 kernel:  __add_to_page_cache_locked+0x36c/0x3d0
worker1 kernel:  ? scan_shadow_nodes+0x30/0x30
worker1 kernel:  add_to_page_cache_lru+0x4a/0xc0
worker1 kernel:  iomap_readpages_actor+0x103/0x230
worker1 kernel:  iomap_apply+0xfb/0x330
worker1 kernel:  ? iomap_ioend_try_merge+0xf0/0xf0
worker1 kernel:  ? iomap_ioend_try_merge+0xf0/0xf0
worker1 kernel:  iomap_readpages+0xa8/0x1f0
worker1 kernel:  ? iomap_ioend_try_merge+0xf0/0xf0
worker1 kernel:  read_pages+0x6b/0x1a0
worker1 kernel:  __do_page_cache_readahead+0x1c5/0x1e0
worker1 kernel:  filemap_fault+0x770/0xa10
worker1 kernel:  ? __mod_lruvec_page_state+0x5e/0x80
worker1 kernel:  ? page_add_file_rmap+0x99/0x130
worker1 kernel:  ? alloc_set_pte+0xb8/0x3f0
worker1 kernel:  ? xas_load+0x8/0x80
worker1 kernel:  ? _cond_resched+0x15/0x30
worker1 kernel:  __xfs_filemap_fault+0x6d/0x200 [xfs]
worker1 kernel:  __do_fault+0x38/0xc0
worker1 kernel:  do_fault+0x1a4/0x3c0
worker1 kernel:  ? recalc_sigpending+0x17/0x60
worker1 kernel:  __handle_mm_fault+0x4a7/0x7f0
worker1 kernel:  handle_mm_fault+0xc1/0x1e0
worker1 kernel:  do_user_addr_fault+0x1b9/0x450
worker1 kernel:  do_page_fault+0x37/0x130
worker1 kernel:  ? page_fault+0x8/0x30
worker1 kernel:  page_fault+0x1e/0x30
worker1 kernel: RIP: 0033:0x43a51a
worker1 kernel: Code: Unable to access opcode bytes at RIP 0x43a4f0.
worker1 kernel: RSP: 002b:000000c000a07158 EFLAGS: 00010206
worker1 kernel: RAX: 0000000001cef2d7 RBX: 0000000000000029 RCX: 0000000002494190
worker1 kernel: RDX: 2e656d69746e7572 RSI: 0000000001cef2d7 RDI: 0000000000000020
worker1 kernel: RBP: 000000c000a071c0 R08: 0000000000000000 R09: 000000000270e580
worker1 kernel: R10: 00000000019ee760 R11: 0000000001ceffe0 R12: 000000c000a070c8
worker1 kernel: R13: 0000000000000000 R14: 000000c00053a680 R15: 0000000000000040
worker1 kernel: memory: usage 1048576kB, limit 1048576kB, failcnt 0
worker1 kernel: memory+swap: usage 1048576kB, limit 1048576kB, failcnt 255160
worker1 kernel: kmem: usage 2792kB, limit 9007199254740988kB, failcnt 0
worker1 kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb51948d8_9fdb_4dee_b012_822fdcfd1fcc.slice/crio-46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa.scope:
worker1 kernel: anon 1059188736
                                      file 11694080
                                      kernel_stack 212992
                                      pagetables 2228224
                                      percpu 288
                                      sock 0
                                      shmem 0
                                      file_mapped 0
                                      file_dirty 0
                                      file_writeback 0
                                      swapcached 0
                                      anon_thp 926941184
                                      file_thp 0
                                      shmem_thp 0
                                      inactive_anon 1059184640
                                      active_anon 4096
                                      inactive_file 11694080
                                      active_file 0
                                      unevictable 0
                                      slab_reclaimable 133760
                                      slab_unreclaimable 222776
                                      slab 356536
                                      workingset_refault_anon 0
                                      workingset_refault_file 632737
                                      workingset_activate_anon 0
                                      workingset_activate_file 41447
                                      workingset_restore_anon 0
                                      workingset_restore_file 12599
                                      workingset_nodereclaim 0
                                      pgfault 27504
                                      pgmajfault 1001
                                      pgrefill 52485
                                      pgscan 3414128
                                      pgsteal 639471
                                      pgactivate 8025
                                      pgdeactivate 49473
                                      pglazyfree 0
                                      pglazyfreed 0
                                      thp_fault_alloc 476
                                      thp_collapse_alloc 0
worker1 kernel: Tasks state (memory values in pages):
worker1 kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
worker1 kernel: [ 158298] 1001770000 158298   476080   258508  2240512        0           976 manager
worker1 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb51948d8_9fdb_4dee_b012_822fdcfd1fcc.slice/crio-46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb51948d8_9fdb_4dee_b012_822fdcfd1fcc.slice/crio-46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa.scope,task=manager,pid=158298,uid=1001770000
worker1 kernel: Memory cgroup out of memory: Killed process 158298 (manager) total-vm:1904320kB, anon-rss:1034032kB, file-rss:0kB, shmem-rss:0kB, UID:1001770000 pgtables:2188kB oom_score_adj:976
worker1 kernel: oom_reaper: reaped process 158298 (manager), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
worker1 hyperkube[2810]: I0324 15:04:35.254112    2810 prober.go:121] "Probe failed" probeType="Readiness" pod="dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx" podUID=b51948d8-9fdb-4dee-b012-822fdcfd1fcc containerName="manager" probeResult=failure output="Get \"http://10.161.10.23:8081/readyz\": read tcp 10.161.10.2:54048->10.161.10.23:8081: read: connection reset by peer"
worker1 conmon[158285]: conmon 46e2a098359ac60961b5 <ninfo>: container 158298 exited with status 137
worker1 systemd[1]: crio-46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa.scope: Succeeded.
worker1 systemd[1]: crio-46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa.scope: Consumed 4.433s CPU time
worker1 systemd[1]: crio-conmon-46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa.scope: Succeeded.
worker1 systemd[1]: crio-conmon-46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa.scope: Consumed 20ms CPU time
worker1 hyperkube[2810]: I0324 15:04:36.153595    2810 generic.go:296] "Generic (PLEG): container finished" podID=b51948d8-9fdb-4dee-b012-822fdcfd1fcc containerID="46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa" exitCode=137
worker1 hyperkube[2810]: I0324 15:04:36.153638    2810 kubelet.go:2133] "SyncLoop (PLEG): event for pod" pod="dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx" event=&{ID:b51948d8-9fdb-4dee-b012-822fdcfd1fcc Type:ContainerDied Data:46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa}
worker1 hyperkube[2810]: I0324 15:04:36.154012    2810 scope.go:110] "RemoveContainer" containerID="46e2a098359ac60961b5721544ce72864b78e4398c850d781f154ad2b2142eaa"
worker1 crio[2768]: time="2023-03-24 15:04:36.154396485Z" level=info msg="Checking image status: docker.io/dellemc/dell-csm-operator@sha256:8a7c76d650fedac90b3abec82e35697bd622f6f97781dbf0590966c905c775e6" id=75fad715-1daf-44c0-9438-5e03d9185268 name=/runtime.v1.ImageService/ImageStatus
worker1 crio[2768]: time="2023-03-24 15:04:36.154553989Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:ee9fe522f5a5c69d66dd5b0cde5adbc45d9f2f6fced9c1b306b9bad5f156df9e,RepoTags:[],RepoDigests:[docker.io/dellemc/dell-csm-operator@sha256:8a7c76d650fedac90b3abec82e35697bd622f6f97781dbf0590966c905c775e6],Size_:280736887,Uid:&Int64Value{Value:1001,},Username:,Spec:nil,},Info:map[string]string{},}" id=75fad715-1daf-44c0-9438-5e03d9185268 name=/runtime.v1.ImageService/ImageStatus
worker1 crio[2768]: time="2023-03-24 15:04:36.155230998Z" level=info msg="Pulling image: docker.io/dellemc/dell-csm-operator@sha256:8a7c76d650fedac90b3abec82e35697bd622f6f97781dbf0590966c905c775e6" id=ca820e12-2487-4e56-9df2-514c8bbdde31 name=/runtime.v1.ImageService/PullImage
worker1 crio[2768]: time="2023-03-24 15:04:36.156705936Z" level=info msg="Trying to access \"docker.io/dellemc/dell-csm-operator@sha256:8a7c76d650fedac90b3abec82e35697bd622f6f97781dbf0590966c905c775e6\""
worker1 hyperkube[2810]: I0324 15:04:36.157881    2810 scope.go:110] "RemoveContainer" containerID="e3f8d9f6d776802e10b1a95932444e08ff222b59f69a53ca8e8f64fa62f8ff6d"
worker1 hyperkube[2810]: E0324 15:04:36.158112    2810 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"openshift-driver-toolkit-ctr\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=openshift-driver-toolkit-ctr pod=nvidia-driver-daemonset-411.86.202302141454-0-hll6f_nvidia-gpu-operator(ce2b634f-be05-4433-a462-41b37c048c5e)\"" pod="nvidia-gpu-operator/nvidia-driver-daemonset-411.86.202302141454-0-hll6f" podUID=ce2b634f-be05-4433-a462-41b37c048c5e
worker1 crio[2768]: time="2023-03-24 15:04:36.203887685Z" level=info msg="Pulled image: docker.io/dellemc/dell-csm-operator@sha256:8a7c76d650fedac90b3abec82e35697bd622f6f97781dbf0590966c905c775e6" id=ca820e12-2487-4e56-9df2-514c8bbdde31 name=/runtime.v1.ImageService/PullImage
worker1 crio[2768]: time="2023-03-24 15:04:36.204436264Z" level=info msg="Checking image status: docker.io/dellemc/dell-csm-operator@sha256:8a7c76d650fedac90b3abec82e35697bd622f6f97781dbf0590966c905c775e6" id=aa1ffa81-519b-4f58-9fb0-9559353d04f1 name=/runtime.v1.ImageService/ImageStatus
worker1 crio[2768]: time="2023-03-24 15:04:36.204572412Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:ee9fe522f5a5c69d66dd5b0cde5adbc45d9f2f6fced9c1b306b9bad5f156df9e,RepoTags:[],RepoDigests:[docker.io/dellemc/dell-csm-operator@sha256:8a7c76d650fedac90b3abec82e35697bd622f6f97781dbf0590966c905c775e6],Size_:280736887,Uid:&Int64Value{Value:1001,},Username:,Spec:nil,},Info:map[string]string{},}" id=aa1ffa81-519b-4f58-9fb0-9559353d04f1 name=/runtime.v1.ImageService/ImageStatus
worker1 crio[2768]: time="2023-03-24 15:04:36.205125552Z" level=info msg="Creating container: dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx/manager" id=ef25e2cc-39cd-49b8-a061-0d8ccb126c28 name=/runtime.v1.RuntimeService/CreateContainer
worker1 crio[2768]: time="2023-03-24 15:04:36.205189998Z" level=warning msg="Allowed annotations are specified for workload [io.containers.trace-syscall]"
worker1 systemd[1]: Started crio-conmon-887a285c8a31812baa5edf6a1d0092aac1b1994617835eee0d8f89123165fe1c.scope.
worker1 systemd[1]: Started libcontainer container 887a285c8a31812baa5edf6a1d0092aac1b1994617835eee0d8f89123165fe1c.
worker1 systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory
worker1 crio[2768]: time="2023-03-24 15:04:36.356233736Z" level=info msg="Created container 887a285c8a31812baa5edf6a1d0092aac1b1994617835eee0d8f89123165fe1c: dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx/manager" id=ef25e2cc-39cd-49b8-a061-0d8ccb126c28 name=/runtime.v1.RuntimeService/CreateContainer
worker1 crio[2768]: time="2023-03-24 15:04:36.356568452Z" level=info msg="Starting container: 887a285c8a31812baa5edf6a1d0092aac1b1994617835eee0d8f89123165fe1c" id=9be52b3e-0b09-42a4-bc7b-b27c270244f1 name=/runtime.v1.RuntimeService/StartContainer
worker1 crio[2768]: time="2023-03-24 15:04:36.366899540Z" level=info msg="Started container" PID=159273 containerID=887a285c8a31812baa5edf6a1d0092aac1b1994617835eee0d8f89123165fe1c description=dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx/manager id=9be52b3e-0b09-42a4-bc7b-b27c270244f1 name=/runtime.v1.RuntimeService/StartContainer sandboxID=5bf793be3c87d07693546f288796dbec38d0cc48648bdfa4355af40bb597a45a
worker1 hyperkube[2810]: I0324 15:04:37.161318    2810 kubelet.go:2133] "SyncLoop (PLEG): event for pod" pod="dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx" event=&{ID:b51948d8-9fdb-4dee-b012-822fdcfd1fcc Type:ContainerStarted Data:`887a285c8a31812baa5edf6a1d0092aac1b1994617835eee0d8f89123165fe1c}
worker1 hyperkube[2810]: I0324 15:04:37.161470    2810 kubelet.go:2205] "SyncLoop (probe)" probe="readiness" status="" pod="dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx"
worker1 ovs-vswitchd[1352]: ovs|00203|connmgr|INFO|br-ex<->unix#489: 2 flow_mods in the last 0 s (2 adds)
worker1 ovs-vswitchd[1352]: ovs|00204|connmgr|INFO|br-ex<->unix#498: 2 flow_mods in the last 0 s (2 adds)
worker1 hyperkube[2810]: I0324 15:04:54.366191    2810 kubelet.go:2205] "SyncLoop (probe)" probe="readiness" status="ready" pod="dell-csm-operators/dell-csm-operator-controller-manager-5f8444d888-9ktqx"

Log from the manager

2023-03-24T16:00:37.873Z    DEBUG   workspace/main.go:79    Operator Version    {"TraceId": "main", "Version": "1.0.0", "Commit ID": "1005033e5e1ef9c8631827372fc3bde061cbbc4d", "Commit SHA": "Mon, 05 Dec 2022 19:46:52 UTC"}
2023-03-24T16:00:37.873Z    DEBUG   workspace/main.go:80    Go Version: go1.19.3    {"TraceId": "main"}
2023-03-24T16:00:37.873Z    DEBUG   workspace/main.go:81    Go OS/Arch: linux/amd64 {"TraceId": "main"}
I0324 16:00:38.925370       1 request.go:665] Waited for 1.038475718s due to client-side throttling, not priority and fairness, request: GET:https://10.168.0.1:443/apis/org.eclipse.che/v1
2023-03-24T16:00:43.576Z    INFO    workspace/main.go:93    Openshift environment   {"TraceId": "main"}
2023-03-24T16:00:43.578Z    INFO    workspace/main.go:132   Current kubernetes version is 1.24 which is a supported version     {"TraceId": "main"}
2023-03-24T16:00:43.578Z    INFO    workspace/main.go:143   Use ConfigDirectory /etc/config/dell-csm-operator   {"TraceId": "main"}
I0324 16:00:48.930166       1 request.go:665] Waited for 5.345922279s due to client-side throttling, not priority and fairness, request: GET:https://10.168.0.1:443/apis/oadp.openshift.io/v1alpha1?timeout=32s
1.679673649284318e+09   INFO    controller-runtime.metrics  Metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
1.6796736492850852e+09  INFO    setup   starting manager
1.679673649285447e+09   INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
1.6796736492855089e+09  INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
I0324 16:00:49.285551       1 leaderelection.go:248] attempting to acquire leader lease dell-csm-operators/090cae6a.dell.com...
I0324 16:01:04.916619       1 leaderelection.go:258] successfully acquired lease dell-csm-operators/090cae6a.dell.com
1.6796736649167643e+09  DEBUG   events  Normal  {"object": {"kind":"ConfigMap","namespace":"dell-csm-operators","name":"090cae6a.dell.com","uid":"c247589b-9931-4258-849b-30ed51006236","apiVersion":"v1","resourceVersion":"1071672528"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-5f8444d888-4c2j9_5e45092b-8ae5-426c-9ef9-b0198fb277cc became leader"}
1.679673664917458e+09   DEBUG   events  Normal  {"object": {"kind":"Lease","namespace":"dell-csm-operators","name":"090cae6a.dell.com","uid":"a39a7677-05e0-47ca-8d6b-5179f2fab1a5","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1071672530"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-5f8444d888-4c2j9_5e45092b-8ae5-426c-9ef9-b0198fb277cc became leader"}
1.6796736649168289e+09  INFO    controller.containerstoragemodule   Starting EventSource    {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "source": "kind source: *v1.ContainerStorageModule"}
1.6796736649174883e+09  INFO    controller.containerstoragemodule   Starting Controller {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule"}
1.6796736650186627e+09  INFO    controller.containerstoragemodule   Starting workers    {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "worker count": 1}
2023-03-24T16:01:05.018Z    INFO    controllers/csm_controller.go:199   ################Starting Reconcile##############    {"TraceId": "isilon-1"}
2023-03-24T16:01:05.018Z    INFO    controllers/csm_controller.go:202   reconcile for   {"TraceId": "isilon-1", "Namespace": "dell-storage-powerscale", "Name": "isilon", "Attempt": 1}
2023-03-24T16:01:05.018Z    DEBUG   drivers/powerscale.go:79    preCheck    {"TraceId": "isilon-1", "skipCertValid": false, "certCount": 1, "secrets": 1}
I0324 16:01:20.585350       1 trace.go:205] Trace[443187817]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167 (24-Mar-2023 16:01:05.019) (total time: 15566ms):
Trace[443187817]: ---"Objects listed" error:<nil> 15451ms (16:01:20.470)
Trace[443187817]: [15.566040537s] [15.566040537s] END

Screenshots

No response

Additional Environment Information

No response

Steps to Reproduce

  1. Create namespace
    apiVersion: v1
    kind: Namespace
    metadata:
    annotations:
    argocd.argoproj.io/sync-wave: "-3"
    openshift.io/description: |
      The Dell CSM Operator enhances the Dell CSI Operator
    openshift.io/display-name:  Dell CSM Operators
    name: dell-csm-operators
    labels:
    openshift.io/cluster-monitoring: "true"
  2. Create operatorgroup
    apiVersion: operators.coreos.com/v1
    kind: OperatorGroup
    metadata:
    name: dell-csm-operator-certified
    namespace: dell-csm-operators
    annotations:
    argocd.argoproj.io/sync-wave: "-2"
    spec:
    targetNamespaces:
    - dell-csm-operators
  3. Create Subscription
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
    name: dell-csm-operator-subscription
    namespace: dell-csm-operators
    annotations:
    argocd.argoproj.io/sync-wave: "-1"
    spec:
    channel: stable
    installPlanApproval: Manual
    name: dell-csm-operator-certified
    source: certified-operator-catalog
    sourceNamespace: openshift-marketplace
  4. Manually approve the pending install plan to complete the Operator installation https://docs.openshift.com/container-platform/4.11/operators/admin/olm-upgrading-operators.html#olm-approving-pending-upgrade_olm-upgrading-operators
  5. Create ContainerStorageModule for PowerScale with active Observability Used https://github.com/dell/csm-operator/blob/3c456c45581d6b70d1ea4f54976974a361123bb1/samples/storage_csm_powerscale_v250.yaml as template

    apiVersion: storage.dell.com/v1
    kind: ContainerStorageModule
    metadata:
    name: isilon
    namespace: dell-storage-powerscale
    spec:
    driver:
    common:
      image: "dellemc/csi-isilon:v2.5.0"
      imagePullPolicy: IfNotPresent
      envs:
        - name: X_CSI_VERBOSE
          value: "1"
        - name: X_CSI_ISI_PORT
          value: "8080"
        - name: X_CSI_ISI_NO_PROBE_ON_START
          value: "false"
        - name: X_CSI_ISI_AUTOPROBE
          value: "true"
        - name: X_CSI_ISI_AUTH_TYPE
          value: "0"
        - name: X_CSI_CUSTOM_TOPOLOGY_ENABLED
          value: "false"
        - name: X_CSI_MAX_PATH_LIMIT
          value: "192"
        - name: X_CSI_DEBUG
          value: "false"
        - name: "CERT_SECRET_COUNT"
          value: "1"
        - name: KUBELET_CONFIG_DIR
          value: /var/lib/kubelet
    configVersion: v2.5.0
    controller:
      envs:
      - name: X_CSI_ISI_QUOTA_ENABLED
        value: "false"
      - name: X_CSI_ISI_ACCESS_ZONE
        value: "zone"
      - name: X_CSI_HEALTH_MONITOR_ENABLED
        value: "false"
    
      nodeSelector:
        node-role.kubernetes.io/worker: ''
    dnsPolicy: ClusterFirst
    csiDriverSpec:
      fSGroupPolicy: ReadWriteOnceWithFSType
    csiDriverType: "isilon"
    forceRemoveDriver: true
    
    node:
      nodeSelector:
        lm.se/zone: internal
      envs:
      - name: X_CSI_ISILON_NFS_V3
        value: "false"
      - name: X_CSI_MAX_VOLUMES_PER_NODE
        value: "0"
      - name: X_CSI_HEALTH_MONITOR_ENABLED
        value: "false"
    replicas: 2
    sideCars:
      - name: common
        args:
          - '--leader-election-lease-duration=15s'
          - '--leader-election-renew-deadline=10s'
          - '--leader-election-retry-period=5s'
      - name: provisioner
        args:
          - '--volume-name-prefix=csipscale'
      - name: external-health-monitor
        enabled: false
        args: ["--monitor-interval=60s"]
    modules:
    - name: observability
      enabled: true
      configVersion: v1.4.0
      components: 
        - name: topology
          enabled: true
          image: dellemc/csm-topology:v1.4.0
          envs:
            - name: "TOPOLOGY_LOG_LEVEL"
              value: "INFO"              
        - name: otel-collector
          enabled: true
          image: otel/opentelemetry-collector:0.42.0
        - name: metrics-powerscale
          enabled: true
          image: dellemc/csm-metrics-powerscale:v1.1.0
          envs:
            - name: "POWERSCALE_MAX_CONCURRENT_QUERIES"
              value: "10"
            - name: "POWERSCALE_CAPACITY_METRICS_ENABLED"
              value: "true"
            - name: "POWERSCALE_PERFORMANCE_METRICS_ENABLED"
              value: "true"
            - name: "POWERSCALE_CLUSTER_CAPACITY_POLL_FREQUENCY"
              value: "30"
            - name: "POWERSCALE_CLUSTER_PERFORMANCE_POLL_FREQUENCY"
              value: "20"
            - name: "POWERSCALE_QUOTA_CAPACITY_POLL_FREQUENCY"
              value: "30"
            - name: "ISICLIENT_INSECURE"
              value: "true"
            - name: "ISICLIENT_AUTH_TYPE"
              value: "1"
            - name: "ISICLIENT_VERBOSE"
              value: "0"
            - name: "POWERSCALE_LOG_LEVEL"
              value: "INFO"
            - name: "POWERSCALE_LOG_FORMAT"
              value: "TEXT"
            - name: "COLLECTOR_ADDRESS"
              value: "otel-collector:55680"

Expected Behavior

CSI Driver for PowerScale initialized and running with Observability activated

CSM Driver(s)

CSI Driver for PowerScale v2.5.0

Installation Type

https://catalog.redhat.com/software/containers/dellemc/dell-csm-operator/63a2b0f2d571d0d25998cc55

Container Storage Modules Enabled

Observability v1.4.0

Container Orchestrator

OpenShift 4.11

Operating System

CoreOS for OpenShift 4.11

csmbot commented 1 year ago

@grvn: Thank you for submitting this issue!

The issue is currently awaiting triage. Please make sure you have given us as much context as possible.

If the maintainers determine this is a relevant issue, they will remove the needs-triage label and assign an appropriate priority label.


We want your feedback! If you have any questions or suggestions regarding our contributing process/workflow, please reach out to us at container.storage.modules@dell.com.

grvn commented 1 year ago

Modifying the Subscription so the operator gets ~500% more resources than default and the observability stack seems to be created in the karavi namespace without the operator getting OOMKilled.

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: dell-csm-operator-subscription
  namespace: dell-csm-operators
  annotations:
    argocd.argoproj.io/sync-wave: "-1"
spec:
  channel: stable
  installPlanApproval: Manual
  name: dell-csm-operator-certified
  source: certified-operator-catalog
  sourceNamespace: openshift-marketplace
  config:
    resources:
      limits:
        cpu: 1
        memory: 2Gi
      requests:
        cpu: 100m
        memory: 400Mi
grvn commented 1 year ago

Have I missunderstood this operator?

I've been digging through dell git-repos and found information about installing karavi modules, but nothing about installing them with the operator. After giving the operator a lot of resources I find myself with deployments in the karavi namespace that are failing because of missing secrets and stuff like that. I understand that "Project Karavi" = "Dell Container Storage Module" but it seems that not everything has made it into the operator.

Am I suppost to run the Helm chart https://github.com/dell/helm-charts/tree/main/charts/karavi-observability first in order to have observability installed and than install the ContainerStorageModule through the operator?

And the same question goes for the rest of the modules, Am I suppost to run the Helm chart ...... to install each module and than install the ContainerStorageModule through the operator?

rensyct commented 1 year ago

Hi @grvn Currently Container Storage Modules operator have supports for 3 modules Authorization, Observability and Replication when the PowerScale driver is installed via Container Storage Modules operator Steps mentioned in https://dell.github.io/csm-docs/docs/deployment/csmoperator/modules/observability/ can be followed to install Observability module for PowerScale driver via Container Storage Modules operator

As per the documentation, creation of namespace karavi is the first prerequisite for installing Observability module for PowerScale driver

grvn commented 1 year ago

Hi @rensyct I missed that documentation page. That explains why the operator insists on that namespace.

Regarding the other step "Install cert-manager", is it possible to use the existing internal cert-manager in OpenShift or does the operator have a dependency for a specific CRD which the jetstack/cert-manager provides?

rensyct commented 1 year ago

Hi @grvn We can install Observability module via HelmCharts or via csm-operator.

grvn commented 1 year ago

Hi @rensyct

I realize that I should follow those docs. I'm wondering if it is possible to replace the jetstack cert-manager with OpenShift certification manager: https://docs.openshift.com/container-platform/4.12/security/certificates/service-serving-certificate.html

Something like

  1. annotate service service.beta.openshift.io/serving-cert-secret-name=otel-collector-tls
  2. annotate service service.beta.openshift.io/serving-cert-secret-name=karavi-topology-tls
  3. create configmap
  4. annotate configmap service.beta.openshift.io/inject-cabundle=true

step 1 and 2 creates 2 diffrent secrets with certificates for *.<service.name>.<service.namespace>.svc

apiVersion: v1
kind: Secret
type: kubernetes.io/tls
metadata:
  name: ....
data:
  tls.crt: <certificate>
  tls.key: <privateKey>

and step 3 and 4 creates a ConfigMap with the CA which can be mounted into whatever pod that wants to talk to the services.

rensyct commented 1 year ago

Hi @grvn , We dont have a dependency in code for CRD from jetstack/cert-manager. What we check in the code is if the secret is of a specific name for each component. We have not tested Observability module with Openshift certification manager, so we are not sure if Observability module will work as expected by following the steps that you listed above.

grvn commented 1 year ago

Hi @rensyct, Then I might play around with it, see if I can get the observability to work with OpenShift certification manager. Maybe a pull request to samples if I get it working.

Back to the problem with the OOMKilled. I've switched enabled: false on all observability module parts in the ContainerStorageModule

spec:
  modules:
    - components:
        - envs:
            - name: PROXY_HOST
              value: ''
            - name: SKIP_CERTIFICATE_VALIDATION
              value: 'true'
          image: >-
            dellemc/csm-authorization-sidecar:v1.5.0
          name: karavi-authorization-proxy
      configVersion: v1.5.0
      enabled: false
      name: authorization
    - components:
        - envs:
            - name: X_CSI_REPLICATION_PREFIX
              value: replication.storage.dell.com
            - name: X_CSI_REPLICATION_CONTEXT_PREFIX
              value: powerscale
          image: 'dellemc/dell-csi-replicator:v1.3.0'
          name: dell-csi-replicator
        - envs:
            - name: TARGET_CLUSTERS_IDS
              value: cluster-151
            - name: REPLICATION_CTRL_LOG_LEVEL
              value: debug
            - name: REPLICATION_CTRL_REPLICAS
              value: '1'
            - name: RETRY_INTERVAL_MIN
              value: 1s
            - name: RETRY_INTERVAL_MAX
              value: 5m
          image: >-
            dellemc/dell-replication-controller:v1.3.1
          name: dell-replication-controller-manager
      configVersion: v1.3.0
      enabled: false
      name: replication
    - components:
        - enabled: false
          envs:
            - name: TOPOLOGY_LOG_LEVEL
              value: INFO
          image: 'dellemc/csm-topology:v1.4.0'
          name: topology
        - enabled: false
          envs:
            - name: NGINX_PROXY_IMAGE
              value: >-
                nginxinc/nginx-unprivileged:1.20
          image: >-
            otel/opentelemetry-collector:0.42.0
          name: otel-collector
        - enabled: false
          envs:
            - name: POWERSCALE_MAX_CONCURRENT_QUERIES
              value: '10'
            - name: POWERSCALE_CAPACITY_METRICS_ENABLED
              value: 'true'
            - name: POWERSCALE_PERFORMANCE_METRICS_ENABLED
              value: 'true'
            - name: POWERSCALE_CLUSTER_CAPACITY_POLL_FREQUENCY
              value: '30'
            - name: POWERSCALE_CLUSTER_PERFORMANCE_POLL_FREQUENCY
              value: '20'
            - name: POWERSCALE_QUOTA_CAPACITY_POLL_FREQUENCY
              value: '30'
            - name: ISICLIENT_INSECURE
              value: 'true'
            - name: ISICLIENT_AUTH_TYPE
              value: '1'
            - name: ISICLIENT_VERBOSE
              value: '0'
            - name: POWERSCALE_LOG_LEVEL
              value: INFO
            - name: POWERSCALE_LOG_FORMAT
              value: TEXT
            - name: COLLECTOR_ADDRESS
              value: 'otel-collector:55680'
          image: >-
            dellemc/csm-metrics-powerscale:v1.1.0
          name: metrics-powerscale
      configVersion: v1.4.0
      enabled: false
      name: observability

But the Dell CSM Operator still gets OOMKilled... it seems like it tries to delete observability that it couldn't create and can't find the isilon-controller. Log from the manager container when it gets OOMKilled:

2023-03-28T13:33:30.076Z    DEBUG   workspace/main.go:79    Operator Version    {"TraceId": "main", "Version": "1.0.0", "Commit ID": "1005033e5e1ef9c8631827372fc3bde061cbbc4d", "Commit SHA": "Mon, 05 Dec 2022 19:46:52 UTC"}
2023-03-28T13:33:30.076Z    DEBUG   workspace/main.go:80    Go Version: go1.19.3    {"TraceId": "main"}
2023-03-28T13:33:30.076Z    DEBUG   workspace/main.go:81    Go OS/Arch: linux/amd64 {"TraceId": "main"}
I0328 13:33:31.128292       1 request.go:665] Waited for 1.036833615s due to client-side throttling, not priority and fairness, request: GET:https://10.168.0.1:443/apis/rbac.authorization.k8s.io/v1
2023-03-28T13:33:35.779Z    INFO    workspace/main.go:93    Openshift environment   {"TraceId": "main"}
2023-03-28T13:33:35.781Z    INFO    workspace/main.go:132   Current kubernetes version is 1.24 which is a supported version     {"TraceId": "main"}
2023-03-28T13:33:35.781Z    INFO    workspace/main.go:143   Use ConfigDirectory /etc/config/dell-csm-operator   {"TraceId": "main"}
I0328 13:33:41.133268       1 request.go:665] Waited for 5.346299682s due to client-side throttling, not priority and fairness, request: GET:https://10.168.0.1:443/apis/operator.openshift.io/v1?timeout=32s
1.6800104215010788e+09  INFO    controller-runtime.metrics  Metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
1.680010421501916e+09   INFO    setup   starting manager
1.6800104215023746e+09  INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
1.6800104215024383e+09  INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
I0328 13:33:41.502480       1 leaderelection.go:248] attempting to acquire leader lease dell-csm-operators/090cae6a.dell.com...
I0328 13:33:58.661152       1 leaderelection.go:258] successfully acquired lease dell-csm-operators/090cae6a.dell.com
1.680010438661202e+09   DEBUG   events  Normal  {"object": {"kind":"ConfigMap","namespace":"dell-csm-operators","name":"090cae6a.dell.com","uid":"c247589b-9931-4258-849b-30ed51006236","apiVersion":"v1","resourceVersion":"1078814203"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-5ddbc7d9dc-7rflk_17299c00-70d3-4cbc-a8a0-2d8f913a533b became leader"}
1.6800104386613238e+09  DEBUG   events  Normal  {"object": {"kind":"Lease","namespace":"dell-csm-operators","name":"090cae6a.dell.com","uid":"a39a7677-05e0-47ca-8d6b-5179f2fab1a5","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1078814204"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-5ddbc7d9dc-7rflk_17299c00-70d3-4cbc-a8a0-2d8f913a533b became leader"}
1.680010438661414e+09   INFO    controller.containerstoragemodule   Starting EventSource    {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "source": "kind source: *v1.ContainerStorageModule"}
1.6800104386614475e+09  INFO    controller.containerstoragemodule   Starting Controller {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule"}
1.6800104387621608e+09  INFO    controller.containerstoragemodule   Starting workers    {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "worker count": 1}
2023-03-28T13:33:58.762Z    INFO    controllers/csm_controller.go:199   ################Starting Reconcile##############    {"TraceId": "isilon-1"}
2023-03-28T13:33:58.762Z    INFO    controllers/csm_controller.go:202   reconcile for   {"TraceId": "isilon-1", "Namespace": "dell-storage-powerscale", "Name": "isilon", "Attempt": 1}
2023-03-28T13:33:58.762Z    DEBUG   drivers/powerscale.go:79    preCheck    {"TraceId": "isilon-1", "skipCertValid": false, "certCount": 1, "secrets": 1}
2023-03-28T13:34:01.763Z    INFO    controllers/csm_controller.go:1110  proceeding with modification of driver install  {"TraceId": "isilon-1"}
2023-03-28T13:34:01.767Z    INFO    controllers/csm_controller.go:1047  Owner reference is found and matches    {"TraceId": "isilon-1"}
2023-03-28T13:34:01.767Z    INFO    utils/status.go:63  deployment status for cluster: default-source-cluster   {"TraceId": "isilon-1"}
2023-03-28T13:34:01.969Z    INFO    utils/status.go:79  driver type: isilon {"TraceId": "isilon-1"}
2023-03-28T13:34:03.370Z    INFO    utils/status.go:152 
daemonset status for cluster: default-source-cluster    {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z    INFO    utils/status.go:200 daemonset status available pods 7   {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z    INFO    utils/status.go:201 daemonset status failedCount pods 0 {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z    INFO    utils/status.go:202 daemonset status desired pods 7 {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z    INFO    utils/status.go:229 deployment controllerReplicas [2]   {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z    INFO    utils/status.go:230 deployment controllerStatus.Available [0]   {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z    INFO    utils/status.go:232 daemonset expected [7]  {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z    INFO    utils/status.go:233 daemonset nodeStatus.Available [7]  {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z    INFO    utils/status.go:239 calculate overall state [Failed]    {"TraceId": "isilon-1"}
################End Reconcile##############
2023-03-28T13:34:03.471Z    INFO    utils/status.go:260 Driver State    {"TraceId": "isilon-1", "Controller": {"available":"0","desired":"2","failed":"0"}, "Node": {"available":"7","desired":"7","failed":"0"}}
2023-03-28T13:34:03.471Z    INFO    utils/status.go:330 HandleSuccess Driver state  {"TraceId": "isilon-1", "newStatus.State": "Failed"}
2023-03-28T13:34:03.471Z    INFO    controllers/csm_controller.go:836   Getting isilon CSI Driver for Dell Technologies {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z    DEBUG   drivers/commonconfig.go:285 GetConfigMap    {"TraceId": "isilon-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/powerscale/v2.5.0/driver-config-params.yaml"}
2023-03-28T13:34:03.472Z    DEBUG   drivers/commonconfig.go:317 GetCSIDriver    {"TraceId": "isilon-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/powerscale/v2.5.0/csidriver.yaml"}
2023-03-28T13:34:03.473Z    DEBUG   drivers/commonconfig.go:339 GetCSIDriver    {"TraceId": "isilon-1", "fsGroupPolicy": "ReadWriteOnceWithFSType"}
2023-03-28T13:34:03.473Z    DEBUG   drivers/commonconfig.go:153 GetNode {"TraceId": "isilon-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/powerscale/v2.5.0/node.yaml"}
2023-03-28T13:34:03.478Z    INFO    drivers/commonconfig.go:208 Container to be enabled {"TraceId": "isilon-1", "name": "registrar"}
2023-03-28T13:34:03.478Z    DEBUG   drivers/commonconfig.go:40  GetController   {"TraceId": "isilon-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/powerscale/v2.5.0/controller.yaml"}
2023-03-28T13:34:03.482Z    INFO    drivers/commonconfig.go:93  Container to be enabled {"TraceId": "isilon-1", "name": "resizer"}
2023-03-28T13:34:03.482Z    INFO    drivers/commonconfig.go:93  Container to be enabled {"TraceId": "isilon-1", "name": "attacher"}
2023-03-28T13:34:03.482Z    INFO    drivers/commonconfig.go:97  Container to be removed {"TraceId": "isilon-1", "name": "external-health-monitor"}
2023-03-28T13:34:03.482Z    INFO    drivers/commonconfig.go:93  Container to be enabled {"TraceId": "isilon-1", "name": "provisioner"}
2023-03-28T13:34:03.482Z    INFO    drivers/commonconfig.go:93  Container to be enabled {"TraceId": "isilon-1", "name": "snapshotter"}
2023-03-28T13:34:03.482Z    INFO    controllers/csm_controller.go:519   Checking if standalone modules need clean up    {"TraceId": "isilon-1"}
2023-03-28T13:34:03.482Z    INFO    controllers/csm_controller.go:580   Deleting observability  {"TraceId": "isilon-1"}
2023-03-28T13:34:03.482Z    INFO    controllers/csm_controller.go:765   reconcile topology  {"TraceId": "isilon-1"}

but running kubectl get pods or oc get pods shows that the isilon-controller are running

oc get pods -n dell-storage-powerscale
NAME                                 READY   STATUS 
isilon-controller-7969c65c8d-7bxmb   5/5     Running
isilon-controller-7969c65c8d-dz2gx   5/5     Running
isilon-node-6kldf                    2/2     Running
isilon-node-7hrxz                    2/2     Running
isilon-node-bggvz                    2/2     Running
isilon-node-bm66m                    2/2     Running
isilon-node-fkqlx                    2/2     Running
isilon-node-jddzp                    2/2     Running
isilon-node-jnsf5                    2/2     Running
isilon-node-kh48b                    2/2     Running
isilon-node-mkb4j                    2/2     Running

Have I killed the operator so badly that it can't recover because of the missing karavi namespace am I missing something? Can I remove the Operator, namespace and all the other stuff, then reinstall it without breaking the PV already using the storageClass with provisioner: csi-isilon.dellemc.com?

rensyct commented 1 year ago

Hi @grvn Please help schedule a zoom call so that I could take a look at your environment wrt Operator. I work in IST timezone

grvn commented 1 year ago

I got the observability working by using the OpenShift cert-manager. It was quite simple, I created these 2 services in the karavi namespace which triggered the OpenShift cert-manager to create the secrets otel-collector-tls and karavi-topology-tls

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.openshift.io/serving-cert-secret-name: otel-collector-tls
    argocd.argoproj.io/sync-wave: "-1"
  name: otel-collector
  namespace: karavi
spec:
  type: ClusterIP
  ports:
    - port: 55680
      targetPort: 55680
      name: receiver
    - port: 8443
      targetPort: 8443
      name: exporter-https
  selector:
    app.kubernetes.io/name: otel-collector
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.openshift.io/serving-cert-secret-name: karavi-topology-tls
    argocd.argoproj.io/sync-wave: "-1"
  name: karavi-topology
  namespace: karavi
spec:
  type: ClusterIP
  ports:
  - name: karavi-topology
    port: 8443
    targetPort: 8443
  selector:
    app.kubernetes.io/name: karavi-topology

then I reactivated the observability module. With the namespace present and the secrets created by OpenShift cert manager the Dell CSM Operator started behaving a bit better. All the already present PV seems to work without any issues. I still need to give the Dell CSM Operator more resources than what OpenShift per default gives an operator. ( OpenShift gives an operator 256MB of RAM as Limit and the Dell CSM Operator peeks at 370-400 MB of RAM ).

Still a bit interesting that it peeks above 2GB of RAM when the karavi namespace is missing and observability module is activated, but that was my usage error.

Now the observability seems to be doing stuff...

I don't know if the certificates are all working since the different parts of the module seems to have other errors, so a pull request to sample will have to wait.

I don't know if I should keep this issue open since this is a bit of topic from its original topic, maybe better if I create new about the OOMKill part?

rensyct commented 1 year ago

Thank you @grvn for the update. The actual cause of the issue reported here was because karavi namespace was not created prior to installing the module. Currently everything works as expected once the namespace is created. Please confirm if my understanding is correct

grvn commented 1 year ago

@rensyct - that sound correct. This issue was created because the Dell CSM Operator didn't create all the resources that the optional modules needed and instead of creating them it started eating a lot of memory and got itself OOMKilled.

The "new" issue is that the Dell CSM Operator uses more resources (memory) than it asks for when the observability module is activated which triggers the OOMKiller to kill the "manager" container in the "dell-csm-operator-controller-manager" pod.

rensyct commented 1 year ago

Thank you @grvn for confirming on this. If so, please close this ticket and open another ticket for Dell CSM Operator using more resources when Observability is enabled.

grvn commented 1 year ago

Closing in favor of https://github.com/dell/csm/issues/738