Closed grvn closed 1 year ago
@grvn: Thank you for submitting this issue!
The issue is currently awaiting triage. Please make sure you have given us as much context as possible.
If the maintainers determine this is a relevant issue, they will remove the needs-triage label and assign an appropriate priority label.
We want your feedback! If you have any questions or suggestions regarding our contributing process/workflow, please reach out to us at container.storage.modules@dell.com.
Modifying the Subscription so the operator gets ~500% more resources than default and the observability stack seems to be created in the karavi namespace without the operator getting OOMKilled.
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: dell-csm-operator-subscription
namespace: dell-csm-operators
annotations:
argocd.argoproj.io/sync-wave: "-1"
spec:
channel: stable
installPlanApproval: Manual
name: dell-csm-operator-certified
source: certified-operator-catalog
sourceNamespace: openshift-marketplace
config:
resources:
limits:
cpu: 1
memory: 2Gi
requests:
cpu: 100m
memory: 400Mi
Have I missunderstood this operator?
I've been digging through dell git-repos and found information about installing karavi modules, but nothing about installing them with the operator. After giving the operator a lot of resources I find myself with deployments in the karavi namespace that are failing because of missing secrets and stuff like that. I understand that "Project Karavi" = "Dell Container Storage Module" but it seems that not everything has made it into the operator.
Am I suppost to run the Helm chart https://github.com/dell/helm-charts/tree/main/charts/karavi-observability first in order to have observability installed and than install the ContainerStorageModule through the operator?
And the same question goes for the rest of the modules,
Am I suppost to run the Helm chart ...
Hi @grvn Currently Container Storage Modules operator have supports for 3 modules Authorization, Observability and Replication when the PowerScale driver is installed via Container Storage Modules operator Steps mentioned in https://dell.github.io/csm-docs/docs/deployment/csmoperator/modules/observability/ can be followed to install Observability module for PowerScale driver via Container Storage Modules operator
As per the documentation, creation of namespace karavi
is the first prerequisite for installing Observability module for PowerScale driver
Hi @rensyct I missed that documentation page. That explains why the operator insists on that namespace.
Regarding the other step "Install cert-manager", is it possible to use the existing internal cert-manager in OpenShift or does the operator have a dependency for a specific CRD which the jetstack/cert-manager
provides?
Hi @grvn We can install Observability module via HelmCharts or via csm-operator.
Steps to be followed to install via HelmCharts is listed in https://dell.github.io/csm-docs/docs/observability/deployment/
Steps to be followed to install via csm-operator is listed in https://dell.github.io/csm-docs/docs/deployment/csmoperator/modules/observability/
Hi @rensyct
I realize that I should follow those docs. I'm wondering if it is possible to replace the jetstack cert-manager with OpenShift certification manager: https://docs.openshift.com/container-platform/4.12/security/certificates/service-serving-certificate.html
Something like
step 1 and 2 creates 2 diffrent secrets with certificates for *.<service.name>.<service.namespace>.svc
apiVersion: v1
kind: Secret
type: kubernetes.io/tls
metadata:
name: ....
data:
tls.crt: <certificate>
tls.key: <privateKey>
and step 3 and 4 creates a ConfigMap with the CA which can be mounted into whatever pod that wants to talk to the services.
Hi @grvn , We dont have a dependency in code for CRD from jetstack/cert-manager. What we check in the code is if the secret is of a specific name for each component. We have not tested Observability module with Openshift certification manager, so we are not sure if Observability module will work as expected by following the steps that you listed above.
Hi @rensyct, Then I might play around with it, see if I can get the observability to work with OpenShift certification manager. Maybe a pull request to samples if I get it working.
Back to the problem with the OOMKilled.
I've switched enabled: false
on all observability module parts in the ContainerStorageModule
spec:
modules:
- components:
- envs:
- name: PROXY_HOST
value: ''
- name: SKIP_CERTIFICATE_VALIDATION
value: 'true'
image: >-
dellemc/csm-authorization-sidecar:v1.5.0
name: karavi-authorization-proxy
configVersion: v1.5.0
enabled: false
name: authorization
- components:
- envs:
- name: X_CSI_REPLICATION_PREFIX
value: replication.storage.dell.com
- name: X_CSI_REPLICATION_CONTEXT_PREFIX
value: powerscale
image: 'dellemc/dell-csi-replicator:v1.3.0'
name: dell-csi-replicator
- envs:
- name: TARGET_CLUSTERS_IDS
value: cluster-151
- name: REPLICATION_CTRL_LOG_LEVEL
value: debug
- name: REPLICATION_CTRL_REPLICAS
value: '1'
- name: RETRY_INTERVAL_MIN
value: 1s
- name: RETRY_INTERVAL_MAX
value: 5m
image: >-
dellemc/dell-replication-controller:v1.3.1
name: dell-replication-controller-manager
configVersion: v1.3.0
enabled: false
name: replication
- components:
- enabled: false
envs:
- name: TOPOLOGY_LOG_LEVEL
value: INFO
image: 'dellemc/csm-topology:v1.4.0'
name: topology
- enabled: false
envs:
- name: NGINX_PROXY_IMAGE
value: >-
nginxinc/nginx-unprivileged:1.20
image: >-
otel/opentelemetry-collector:0.42.0
name: otel-collector
- enabled: false
envs:
- name: POWERSCALE_MAX_CONCURRENT_QUERIES
value: '10'
- name: POWERSCALE_CAPACITY_METRICS_ENABLED
value: 'true'
- name: POWERSCALE_PERFORMANCE_METRICS_ENABLED
value: 'true'
- name: POWERSCALE_CLUSTER_CAPACITY_POLL_FREQUENCY
value: '30'
- name: POWERSCALE_CLUSTER_PERFORMANCE_POLL_FREQUENCY
value: '20'
- name: POWERSCALE_QUOTA_CAPACITY_POLL_FREQUENCY
value: '30'
- name: ISICLIENT_INSECURE
value: 'true'
- name: ISICLIENT_AUTH_TYPE
value: '1'
- name: ISICLIENT_VERBOSE
value: '0'
- name: POWERSCALE_LOG_LEVEL
value: INFO
- name: POWERSCALE_LOG_FORMAT
value: TEXT
- name: COLLECTOR_ADDRESS
value: 'otel-collector:55680'
image: >-
dellemc/csm-metrics-powerscale:v1.1.0
name: metrics-powerscale
configVersion: v1.4.0
enabled: false
name: observability
But the Dell CSM Operator still gets OOMKilled... it seems like it tries to delete observability
that it couldn't create and can't find the isilon-controller
.
Log from the manager container when it gets OOMKilled:
2023-03-28T13:33:30.076Z DEBUG workspace/main.go:79 Operator Version {"TraceId": "main", "Version": "1.0.0", "Commit ID": "1005033e5e1ef9c8631827372fc3bde061cbbc4d", "Commit SHA": "Mon, 05 Dec 2022 19:46:52 UTC"}
2023-03-28T13:33:30.076Z DEBUG workspace/main.go:80 Go Version: go1.19.3 {"TraceId": "main"}
2023-03-28T13:33:30.076Z DEBUG workspace/main.go:81 Go OS/Arch: linux/amd64 {"TraceId": "main"}
I0328 13:33:31.128292 1 request.go:665] Waited for 1.036833615s due to client-side throttling, not priority and fairness, request: GET:https://10.168.0.1:443/apis/rbac.authorization.k8s.io/v1
2023-03-28T13:33:35.779Z INFO workspace/main.go:93 Openshift environment {"TraceId": "main"}
2023-03-28T13:33:35.781Z INFO workspace/main.go:132 Current kubernetes version is 1.24 which is a supported version {"TraceId": "main"}
2023-03-28T13:33:35.781Z INFO workspace/main.go:143 Use ConfigDirectory /etc/config/dell-csm-operator {"TraceId": "main"}
I0328 13:33:41.133268 1 request.go:665] Waited for 5.346299682s due to client-side throttling, not priority and fairness, request: GET:https://10.168.0.1:443/apis/operator.openshift.io/v1?timeout=32s
1.6800104215010788e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": "127.0.0.1:8080"}
1.680010421501916e+09 INFO setup starting manager
1.6800104215023746e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
1.6800104215024383e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
I0328 13:33:41.502480 1 leaderelection.go:248] attempting to acquire leader lease dell-csm-operators/090cae6a.dell.com...
I0328 13:33:58.661152 1 leaderelection.go:258] successfully acquired lease dell-csm-operators/090cae6a.dell.com
1.680010438661202e+09 DEBUG events Normal {"object": {"kind":"ConfigMap","namespace":"dell-csm-operators","name":"090cae6a.dell.com","uid":"c247589b-9931-4258-849b-30ed51006236","apiVersion":"v1","resourceVersion":"1078814203"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-5ddbc7d9dc-7rflk_17299c00-70d3-4cbc-a8a0-2d8f913a533b became leader"}
1.6800104386613238e+09 DEBUG events Normal {"object": {"kind":"Lease","namespace":"dell-csm-operators","name":"090cae6a.dell.com","uid":"a39a7677-05e0-47ca-8d6b-5179f2fab1a5","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1078814204"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-5ddbc7d9dc-7rflk_17299c00-70d3-4cbc-a8a0-2d8f913a533b became leader"}
1.680010438661414e+09 INFO controller.containerstoragemodule Starting EventSource {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "source": "kind source: *v1.ContainerStorageModule"}
1.6800104386614475e+09 INFO controller.containerstoragemodule Starting Controller {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule"}
1.6800104387621608e+09 INFO controller.containerstoragemodule Starting workers {"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "worker count": 1}
2023-03-28T13:33:58.762Z INFO controllers/csm_controller.go:199 ################Starting Reconcile############## {"TraceId": "isilon-1"}
2023-03-28T13:33:58.762Z INFO controllers/csm_controller.go:202 reconcile for {"TraceId": "isilon-1", "Namespace": "dell-storage-powerscale", "Name": "isilon", "Attempt": 1}
2023-03-28T13:33:58.762Z DEBUG drivers/powerscale.go:79 preCheck {"TraceId": "isilon-1", "skipCertValid": false, "certCount": 1, "secrets": 1}
2023-03-28T13:34:01.763Z INFO controllers/csm_controller.go:1110 proceeding with modification of driver install {"TraceId": "isilon-1"}
2023-03-28T13:34:01.767Z INFO controllers/csm_controller.go:1047 Owner reference is found and matches {"TraceId": "isilon-1"}
2023-03-28T13:34:01.767Z INFO utils/status.go:63 deployment status for cluster: default-source-cluster {"TraceId": "isilon-1"}
2023-03-28T13:34:01.969Z INFO utils/status.go:79 driver type: isilon {"TraceId": "isilon-1"}
2023-03-28T13:34:03.370Z INFO utils/status.go:152
daemonset status for cluster: default-source-cluster {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z INFO utils/status.go:200 daemonset status available pods 7 {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z INFO utils/status.go:201 daemonset status failedCount pods 0 {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z INFO utils/status.go:202 daemonset status desired pods 7 {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z INFO utils/status.go:229 deployment controllerReplicas [2] {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z INFO utils/status.go:230 deployment controllerStatus.Available [0] {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z INFO utils/status.go:232 daemonset expected [7] {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z INFO utils/status.go:233 daemonset nodeStatus.Available [7] {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z INFO utils/status.go:239 calculate overall state [Failed] {"TraceId": "isilon-1"}
################End Reconcile##############
2023-03-28T13:34:03.471Z INFO utils/status.go:260 Driver State {"TraceId": "isilon-1", "Controller": {"available":"0","desired":"2","failed":"0"}, "Node": {"available":"7","desired":"7","failed":"0"}}
2023-03-28T13:34:03.471Z INFO utils/status.go:330 HandleSuccess Driver state {"TraceId": "isilon-1", "newStatus.State": "Failed"}
2023-03-28T13:34:03.471Z INFO controllers/csm_controller.go:836 Getting isilon CSI Driver for Dell Technologies {"TraceId": "isilon-1"}
2023-03-28T13:34:03.471Z DEBUG drivers/commonconfig.go:285 GetConfigMap {"TraceId": "isilon-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/powerscale/v2.5.0/driver-config-params.yaml"}
2023-03-28T13:34:03.472Z DEBUG drivers/commonconfig.go:317 GetCSIDriver {"TraceId": "isilon-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/powerscale/v2.5.0/csidriver.yaml"}
2023-03-28T13:34:03.473Z DEBUG drivers/commonconfig.go:339 GetCSIDriver {"TraceId": "isilon-1", "fsGroupPolicy": "ReadWriteOnceWithFSType"}
2023-03-28T13:34:03.473Z DEBUG drivers/commonconfig.go:153 GetNode {"TraceId": "isilon-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/powerscale/v2.5.0/node.yaml"}
2023-03-28T13:34:03.478Z INFO drivers/commonconfig.go:208 Container to be enabled {"TraceId": "isilon-1", "name": "registrar"}
2023-03-28T13:34:03.478Z DEBUG drivers/commonconfig.go:40 GetController {"TraceId": "isilon-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/powerscale/v2.5.0/controller.yaml"}
2023-03-28T13:34:03.482Z INFO drivers/commonconfig.go:93 Container to be enabled {"TraceId": "isilon-1", "name": "resizer"}
2023-03-28T13:34:03.482Z INFO drivers/commonconfig.go:93 Container to be enabled {"TraceId": "isilon-1", "name": "attacher"}
2023-03-28T13:34:03.482Z INFO drivers/commonconfig.go:97 Container to be removed {"TraceId": "isilon-1", "name": "external-health-monitor"}
2023-03-28T13:34:03.482Z INFO drivers/commonconfig.go:93 Container to be enabled {"TraceId": "isilon-1", "name": "provisioner"}
2023-03-28T13:34:03.482Z INFO drivers/commonconfig.go:93 Container to be enabled {"TraceId": "isilon-1", "name": "snapshotter"}
2023-03-28T13:34:03.482Z INFO controllers/csm_controller.go:519 Checking if standalone modules need clean up {"TraceId": "isilon-1"}
2023-03-28T13:34:03.482Z INFO controllers/csm_controller.go:580 Deleting observability {"TraceId": "isilon-1"}
2023-03-28T13:34:03.482Z INFO controllers/csm_controller.go:765 reconcile topology {"TraceId": "isilon-1"}
but running kubectl get pods
or oc get pods
shows that the isilon-controller
are running
oc get pods -n dell-storage-powerscale
NAME READY STATUS
isilon-controller-7969c65c8d-7bxmb 5/5 Running
isilon-controller-7969c65c8d-dz2gx 5/5 Running
isilon-node-6kldf 2/2 Running
isilon-node-7hrxz 2/2 Running
isilon-node-bggvz 2/2 Running
isilon-node-bm66m 2/2 Running
isilon-node-fkqlx 2/2 Running
isilon-node-jddzp 2/2 Running
isilon-node-jnsf5 2/2 Running
isilon-node-kh48b 2/2 Running
isilon-node-mkb4j 2/2 Running
Have I killed the operator so badly that it can't recover because of the missing karavi
namespace am I missing something?
Can I remove the Operator, namespace and all the other stuff, then reinstall it without breaking the PV already using the storageClass with provisioner: csi-isilon.dellemc.com
?
Hi @grvn Please help schedule a zoom call so that I could take a look at your environment wrt Operator. I work in IST timezone
I got the observability working by using the OpenShift cert-manager.
It was quite simple, I created these 2 services in the karavi
namespace which triggered the OpenShift cert-manager to create the secrets otel-collector-tls
and karavi-topology-tls
apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.openshift.io/serving-cert-secret-name: otel-collector-tls
argocd.argoproj.io/sync-wave: "-1"
name: otel-collector
namespace: karavi
spec:
type: ClusterIP
ports:
- port: 55680
targetPort: 55680
name: receiver
- port: 8443
targetPort: 8443
name: exporter-https
selector:
app.kubernetes.io/name: otel-collector
apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.openshift.io/serving-cert-secret-name: karavi-topology-tls
argocd.argoproj.io/sync-wave: "-1"
name: karavi-topology
namespace: karavi
spec:
type: ClusterIP
ports:
- name: karavi-topology
port: 8443
targetPort: 8443
selector:
app.kubernetes.io/name: karavi-topology
then I reactivated the observability module. With the namespace present and the secrets created by OpenShift cert manager the Dell CSM Operator started behaving a bit better. All the already present PV seems to work without any issues. I still need to give the Dell CSM Operator more resources than what OpenShift per default gives an operator. ( OpenShift gives an operator 256MB of RAM as Limit and the Dell CSM Operator peeks at 370-400 MB of RAM ).
Still a bit interesting that it peeks above 2GB of RAM when the karavi
namespace is missing and observability module is activated, but that was my usage error.
Now the observability seems to be doing stuff...
I don't know if the certificates are all working since the different parts of the module seems to have other errors, so a pull request to sample will have to wait.
I don't know if I should keep this issue open since this is a bit of topic from its original topic, maybe better if I create new about the OOMKill part?
Thank you @grvn for the update. The actual cause of the issue reported here was because karavi namespace was not created prior to installing the module. Currently everything works as expected once the namespace is created. Please confirm if my understanding is correct
@rensyct - that sound correct. This issue was created because the Dell CSM Operator didn't create all the resources that the optional modules needed and instead of creating them it started eating a lot of memory and got itself OOMKilled.
The "new" issue is that the Dell CSM Operator uses more resources (memory) than it asks for when the observability module is activated which triggers the OOMKiller to kill the "manager" container in the "dell-csm-operator-controller-manager" pod.
Thank you @grvn for confirming on this. If so, please close this ticket and open another ticket for Dell CSM Operator using more resources when Observability is enabled.
Closing in favor of https://github.com/dell/csm/issues/738
Bug Description
Installing the Dell Container Storage Modules Operator and creating a ContainerStorageModule for PowerScale v2.5 and activating Observability v1.4.0 makes the container "manager" in the "dell-csm-operator-controller-manager" pod use more memory than its limit which triggers the OOMKiller. Even if I try to increase the limit to 1 GB memory it still gets OOMKilled.
The problem seems to be that the operator fails to create the namespace "karavi".
Workaround: Manually create the namespace "karavi"
Logs
journal from the node
Log from the manager
Screenshots
No response
Additional Environment Information
No response
Steps to Reproduce
Create ContainerStorageModule for PowerScale with active Observability Used https://github.com/dell/csm-operator/blob/3c456c45581d6b70d1ea4f54976974a361123bb1/samples/storage_csm_powerscale_v250.yaml as template
Expected Behavior
CSI Driver for PowerScale initialized and running with Observability activated
CSM Driver(s)
CSI Driver for PowerScale v2.5.0
Installation Type
https://catalog.redhat.com/software/containers/dellemc/dell-csm-operator/63a2b0f2d571d0d25998cc55
Container Storage Modules Enabled
Observability v1.4.0
Container Orchestrator
OpenShift 4.11
Operating System
CoreOS for OpenShift 4.11