ComplianceAsCode / compliance-operator

Operator providing Kubernetes cluster compliance checks
Apache License 2.0
38 stars 23 forks source link

Operator Pod is OOMKilled on Openshift clusters #40

Closed bukovjanmic closed 2 years ago

bukovjanmic commented 2 years ago

We have compliance operator 0.1.49 installed on Opemshift clusters (4.9.29).

On some clusters, the operator is OOMKilled, with different frequencies, as it exceeds 200Mi memory limit.

It seems this may be related to number of worker nodes or namespaces, on larger cluusters (20 worker nodes, hundreds of namespace) the frequency seems to be larger (currently at 225 OOMKill restarts in 24 hours), on smaller clusters (5 worker nodes) the frequency seems to be lower (5 restarts).

Either there is a memory leak under some circumstances or the memory request/limit needs to be increased.

mrogers950 commented 2 years ago

Is the operator running with WATCH_NAMESPACE="" ? (The OLM AllNamespaces installMode sets this also). There's a comment in this area that mentions a possible performance issue with all namespaces: https://github.com/ComplianceAsCode/compliance-operator/blob/38b928ec28a2a865d55a389c6974ee9d03545436/cmd/manager/operator.go#L177

Note that there's really no need to watch all namespaces since the operator API is cluster-wide and not intended for multi-tenant use (although there might be some case that I'm not aware of). OwnNamespace ensures WATCH_NAMESPACE="openshift-compliance", which should not have this problem.

We should probably force the WATCH_NAMESPACE to the operator's namespace when it's set to all, similar to how we did with file-integrity-operator (https://github.com/openshift/file-integrity-operator/pull/234)

bukovjanmic commented 2 years ago

I confirm that limiting operator to a single namespace seems to fix the OOMKill problem.