elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
58 stars 708 forks source link

Global Operator fails with OOMKilled on OpenShift #1416

Closed ron1 closed 4 years ago

ron1 commented 5 years ago

Bug Report

What did you do? Deployed ECK Operator 0.9.0-RC3 Global Operator on OCP 3.11

What did you expect to see? ECK Global Operator Pod with Status "Running"

What did you see instead? Under which circumstances? ECK Global Operator Pod with Status "OOMKilled/CrashLoopBackOff"

Environment OCP 3.11.98

Is it possible that the Global Operator is still using the Process Manager which is problematic on CentOS7/RHEL7 kernels?

BTW, the Namespace Operator OOMKilled issue seems to be fixed in this release. Does the Namespace Operator require the Global Operator if only the Basic License is being used?

pebrc commented 5 years ago

The process manager was only used inside Elasticsearch Pods and it has been removed in https://github.com/elastic/cloud-on-k8s/commit/f2b5288f9fbfa2f219f7478259775c47dd221c3a

Does the Namespace Operator require the Global Operator if only the Basic License is being used?

No it does not. But that might change in the future. The idea behind the global operator was to have some cross cutting concerns only running there, which would also allow us to restrict the privileges of the namespace operators much more. You can also deploy the operator in just one process that has all roles if you want (that is also the variant we use in the 'quick start' documentation)

ron1 commented 5 years ago

Any ideas what would cause the Global Operator on 0.9.0-RC3 to misbehave on CentOS7/RHEL7 in the same the old 0.8.0 Namespace Operator did? As I mentioned, the 0.9.0-RC3 Namespace Operator no longer seems to misbehave. Is it coincidence the Namespace Operator fix seemed to occur right around the time the Process Manager was removed?

barkbay commented 5 years ago

Could you provide more details about the Pod that is OOMKilled ? kubectl get ... -o yaml Could you also provide the oomkiller logs which are available in the kernel log ?

I have been running the last release candidate of ECK (0.9.0-RC7) for a few hours on Openshift 3.11 and I can't reproduce your issue.

$ uname -a
Linux k8s-michael-master-01 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ oc get pods --all-namespaces |grep elastic
elastic-namespace-operators         elastic-namespace-operator-0               1/1       Running   0          5h
elastic-system                      elastic-global-operator-0                  1/1       Running   0          5h
elastic                             elasticsearch-sample-es-5tsqghmm79         1/1       Running   0          5h
elastic                             elasticsearch-sample-es-6qk52mz5jk         1/1       Running   0          5h
elastic                             elasticsearch-sample-es-dg4vvpm2mr         1/1       Running   0          5h
elastic                             kibana-sample-kb-97c6b6b8d-lqfd2           1/1       Running   0          5h
ron1 commented 5 years ago

The global operator pod failed to deploy due to a template bug here: https://github.com/elastic/cloud-on-k8s/blob/40e85d85e6403847bcaf6f910843040934646fea/operators/config/operator/global/operator.template.yaml#L39

I fixed the problem by making the following change to file operator.template.yaml: Original:

       resources:
          limits:
            cpu: 1
            memory: 100Mi

Revised:

       resources:
          limits:
            cpu: 1
            memory: 2Gi
tomqwu commented 4 years ago

out of box operator doesn't work on openshift, same getting OOMKILL, have to change the resources in order to run.