Operator crashes with OOM error because of small limit

dhrp commented 5 years ago

Bug Report

What did you do?

I simply followed the instructions here: https://www.elastic.co/elasticsearch-kubernetes

What did you expect to see? To have an elasticsearch node up and running.

What did you see instead? Under which circumstances?

I noticed the elastic-operator to be OOMKilled:

NAME                     READY   STATUS      RESTARTS   AGE
pod/elastic-operator-0   0/1     OOMKilled   6          7m56s

I noticed the memory limit set for the operator is also small (only 100Mb) https://github.com/elastic/cloud-on-k8s/blob/master/operators/config/operator/all-in-one/operator.template.yaml#L39

Environment

Script version: https://download.elastic.co/downloads/eck/0.9.0/all-in-one.yaml
Kubernetes 1.13.5
Version information:

https://download.elastic.co/downloads/eck/0.9.0/all-in-one.yaml

Kubernetes information:

EC2 on AWS, (not EKS), using Rancher 2.2.2. Kubernetes 1.13.5

$ kubectl version
1.15

Resource definition:

if relevant insert the resource definition

Logs:

insert operator logs or any relevant message to the issue here

Other notes: Interestingly enough I used the same operator on a different cluster on digital ocean, and there it didn't need more that the 100Mb limit.

I now increased the limit on my machine to 500M and it works well. (probably could have done less).

barkbay commented 5 years ago

Thanks for the report. Out of curiosity:

how many clusters/resources are managed by the operator ?
how long does it take for the operator to be killed by the OOMKiller ?

philrhinehart commented 5 years ago

I've been seeing the same issue over the past few days.

In my case, the increased memory usage seems to be correlated to an increase in the number of nodes being managed by the operator (currently 10).

On average, it takes about 30-40 minutes for the pod to be killed due to OOM errors.

Increasing the limit seems to have resolved things for me as well.

charith-elastic commented 5 years ago

@philrhinehart Can you provide more information about the workload? Is it just one Elasticsearch cluster of 10 nodes managed by the operator? What is the Kubernetes cluster utilization? If it is possible to do so without revealing any sensitive information, can you provide the manifest for Elasticsearch as well?

charith-elastic commented 5 years ago

I am closing this issue for now since our internal testing couldn't reproduce the problem. If anybody else experiences the same problem, please re-open this issue and provide details about the environment.

masterkain commented 5 years ago

my elastic operator kept oom too, removed the requirements it is sitting at 188MB memory, one cluster, one node, https://www.elastic.co/guide/en/cloud-on-k8s/1.0/k8s-quickstart.html

charith-elastic commented 5 years ago

@masterkain can you provide more details about your environment? Kubernetes version, self-hosted or cloud, other workloads in the cluster, whether this is a fresh install of ECK etc.

mcfearsome commented 4 years ago

Happening to me too.

ECK Version: 1.0.0-beta1

 15:29:32 ❯  kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-aae39f", GitCommit:"aae39f4697508697bf16c0de4a5687d464f4da81", GitTreeState:"clean", BuildDate:"2019-12-23T08:19:12Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

AWS EKS Operator running a single 3 node cluster, 1 node apm server, and kibana.

Containers:
  manager:
    Container ID:  docker://f2b005e92e2423b1ee0b9bb829bf20a60500233007409f114e39e4f9b0744823
    Image:         docker.elastic.co/eck/eck-operator:1.0.0-beta1
    Image ID:      docker-pullable://docker.elastic.co/eck/eck-operator@sha256:1b612a5ae47fb93144d0ab1dea658c94e87e9eedc9449552fabad2205eee3ed8
    Port:          9876/TCP
    Host Port:     0/TCP
    Args:
      manager
      --operator-roles
      all
      --enable-debug-logs=false
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 11 Feb 2020 10:00:58 -0500
      Finished:     Tue, 11 Feb 2020 10:01:00 -0500
    Ready:          False
    Restart Count:  8
    Limits:
      cpu:     1
      memory:  150Mi
    Requests:
      cpu:     100m
      memory:  50Mi
    Environment:
      OPERATOR_NAMESPACE:  elastic-system (v1:metadata.namespace)
      WEBHOOK_SECRET:      webhook-server-secret
      WEBHOOK_PODS_LABEL:  elastic-operator
      OPERATOR_IMAGE:      docker.elastic.co/eck/eck-operator:1.0.0-beta1
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from elastic-operator-token-gdnj4 (ro)

Anything else I can provide, let me know.

anyasabo commented 4 years ago

@mcfearsome there were multiple changes that reduced memory usage that landed in v1.0.0 (especially https://www.elastic.co/guide/en/cloud-on-k8s/1.0-beta/release-highlights-1.0.0-beta1.html#k8s_memory_leak_in_the_eck_process), I would recommend upgrading and/or increasing the memory limit.

Docteur-RS commented 4 years ago

Same problem here. I couldn't deploy the CRD because the resulting pod was continually killed by OOM.

Which is strange because I had a 2 nodes (Kubernetes) cluster and the operator had no issues, and on this new cluster that has 7 (kubernetes) nodes and more CPU/RAM it gets killed every 30 sec...

I removed the limits/requests section for now and everything seems to be back to normal.

sebgl commented 4 years ago

@Docteur-RS can you provide more details about your Kubernetes environment? Which version of ECK are you using?

Docteur-RS commented 4 years ago

@sebgl Using ECK 1.0 on premise. Kubernetes : 1.16.7 Each kubernetes nodes has about 16 gig of RAM

I updated the default resources to the following and it did not work either:

resources:
  limits:
    cpu: 1
    memory: 350Mi
  requests:
    cpu: 500m
    memory: 300Mi

I pretty much doubled everything...

In the end I just commented the whole "resources" section and it fixed the OOM.

barkbay commented 4 years ago

Could you give us more information about the OS you are using:

lsb_release -a or cat /etc/os-release
uname -a
What distribution of K8S
Also if you can copy/paste the OOMKiller logs, you can usually get them using dmesg, it looks like that:

[3458703.013937] Task in ... killed as a result of limit of ....
[3458703.039061] memory: usage x, limit x, failcnt x3458703.044979] memory+swap: usage x, limit x, failcnt x
[3458703.051495] kmem: usage x, limit x, failcnt 0
[3458703.058078] Memory cgroup stats for x ... active_file:0KB unevictable:0KB
....
[3458703.135532] oom_reaper: reaped process x (x), now anon-rss:x, file-rss:x, shmem-rss:0kB

Docteur-RS commented 4 years ago

@barkbay
Centos 7
Linux p4vm107 5.5.9-1.el7.elrepo.x86_64 #1 SMP Wed Mar 11 19:01:01 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

I applyed the limits / requests that failled and this is the logs I got:

kgp -n elastic-system -w
NAME                      READY   STATUS      RESTARTS   AGE
elastic-operator-0        0/1     OOMKilled   2          55s

2 stacktraces that appears multiple times in dmesg

[767251.824704] Tasks state (memory values in pages):
[767251.825073] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[767251.825763] [  26660]     0 26660      255        1    32768        0          -998 pause
[767251.826394] [  27554]   101 27554   169888    94675   966656        0           982 elastic-operato
[767251.827344] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=d4f18256793a4c968094ffdac57fe51a09c439a2e9afe04aefec738b513d8005,mems_allowed=0,oom_memcg=/kubepods/burstable/pod7b717e63-8036-4ba9-af17-930bf5bab43b,task_memcg=/kubepods/burstable/pod7b717e63-8036-4ba9-af17-930bf5bab43b/d4f18256793a4c968094ffdac57fe51a09c439a2e9afe04aefec738b513d8005,task=elastic-operato,pid=27554,uid=101
[767251.830085] Memory cgroup out of memory: Killed process 27554 (elastic-operato) total-vm:679552kB, anon-rss:354896kB, file-rss:23804kB, shmem-rss:0kB, UID:101 pgtables:944kB oom_score_adj:982
[767251.841491] oom_reaper: reaped process 27554 (elastic-operato), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[767291.775788] elastic-operato invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=982
[767291.776576] CPU: 2 PID: 28225 Comm: elastic-operato Not tainted 5.5.9-1.el7.elrepo.x86_64 #1
[767291.777262] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[767291.777908] Call Trace:
[767291.778202]  dump_stack+0x6d/0x98
[767291.778548]  dump_header+0x51/0x210
[767291.778852]  oom_kill_process+0x102/0x130
[767291.779194]  out_of_memory+0x105/0x510
[767291.779517]  mem_cgroup_out_of_memory+0xb9/0xd0
[767291.779888]  try_charge+0x756/0x7c0
[767291.780211]  ? __alloc_pages_nodemask+0x16c/0x320
[767291.780727]  mem_cgroup_try_charge+0x72/0x1e0
[767291.781221]  mem_cgroup_try_charge_delay+0x22/0x50
[767291.781750]  do_anonymous_page+0x11a/0x650
[767291.782265]  handle_pte_fault+0x2a8/0xad0
[767291.782754]  __handle_mm_fault+0x4a8/0x680
[767291.783223]  ? __switch_to_asm+0x40/0x70
[767291.783639]  handle_mm_fault+0xea/0x200
[767291.784010]  __do_page_fault+0x225/0x490
[767291.784458]  do_page_fault+0x36/0x120
[767291.784845]  page_fault+0x3e/0x50
[767291.785220] RIP: 0033:0x46055f
[767291.785572] Code: 00 00 c5 fe 6f 06 c5 fe 6f 4e 20 c5 fe 6f 56 40 c5 fe 6f 5e 60 48 81 c6 80 00 00 00 c5 fd e7 07 c5 fd e7 4f 20 c5 fd e7 57 40 <c5> fd e7 5f 60 48 81 c7 80 00 00 00 48 81 eb 80 00 00 00 77 b5 0f
[767291.787044] RSP: 002b:000000c000847118 EFLAGS: 00010202
[767291.787501] RAX: 0000000007fffe00 RBX: 0000000000bafde0 RCX: 000000c018e2be00
[767291.788085] RDX: 000000000ffffe00 RSI: 000000c01027c020 RDI: 000000c01827bfa0
[767291.788710] RBP: 000000c000847160 R08: 000000c010e2c000 R09: 0000000000000000
[767291.789327] R10: 0000000000000020 R11: 0000000000000202 R12: 0000000000000002
[767291.789937] R13: 00000000025731c0 R14: 000000000045eea0 R15: 0000000000000000
[767291.790745] memory: usage 358400kB, limit 358400kB, failcnt 1512
[767291.791234] memory+swap: usage 358400kB, limit 9007199254740988kB, failcnt 0
[767291.791823] kmem: usage 2744kB, limit 9007199254740988kB, failcnt 0
[767291.792299] Memory cgroup stats for /kubepods/burstable/pod7b717e63-8036-4ba9-af17-930bf5bab43b:
[767291.794894] anon 363929600

[767291.809817] Tasks state (memory values in pages):
[767291.810587] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[767291.811981] [  26660]     0 26660      255        1    32768        0          -998 pause
[767291.813170] [  28205]   101 28205   169888    94762   958464        0           982 elastic-operato
[767291.814479] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=fe06b305e97236ed3bceebfcf354d2aed79b729003caade99fd193d605c79407,mems_allowed=0,oom_memcg=/kubepods/burstable/pod7b717e63-8036-4ba9-af17-930bf5bab43b,task_memcg=/kubepods/burstable/pod7b717e63-8036-4ba9-af17-930bf5bab43b/fe06b305e97236ed3bceebfcf354d2aed79b729003caade99fd193d605c79407,task=elastic-operato,pid=28205,uid=101
[767291.818732] Memory cgroup out of memory: Killed process 28205 (elastic-operato) total-vm:679552kB, anon-rss:354792kB, file-rss:24256kB, shmem-rss:0kB, UID:101 pgtables:936kB oom_score_adj:982
[767291.827011] oom_reaper: reaped process 28205 (elastic-operato), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

barkbay commented 4 years ago

Centos 7 Linux p4vm107 5.5.9-1.el7.elrepo.x86_64 #1 SMP Wed Mar 11 19:01:01 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Default Kernel for CentOS 7 is 3.10 (CentOS 8 is bundled with 4.18) 5.5 has been released in January . Any reason to not use the default one ?

Kubernetes and container runtimes are relying on low level Kernel functions (like cgroups) I would not advise to use something else that the kernel provided by default for your distribution.

Docteur-RS commented 4 years ago

Hum... if I remember correctly we updated the Kernel version because Cilium (our kubernetes' CNI) needed BPF which was not availble in the default Kernel version we had.

Though I checked the cluster's version on which it was working correctly:
Linux p5vm7 5.4.12-1.el7.elrepo.x86_64 #1 SMP Tue Jan 14 16:02:20 EST 2020 x86_64 x86_64 x86_64 GNU/Linux And on the one it's not:
Linux p4vm107 5.5.9-1.el7.elrepo.x86_64 #1 SMP Wed Mar 11 19:01:01 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Small difference but maybe it's only what it takes.

barkbay commented 4 years ago

I think you can use Centos 8 if you want to use Cillium on CentOS. I'm closing this issue because I'm not sure we will be able to help for this kind of configuration (old distro + very recent Kernel)

pidren commented 4 years ago

Just wanted to add some metric points here:

ECK 1.0 on GKE - the operator kept OOMkilling itself, on average tries to use ~140Mi of memory, and I'm now trying to stabilize this with 200Mi of memory with guaranteed QoS.

anyasabo commented 4 years ago

We increased the default memory limits in https://github.com/elastic/cloud-on-k8s/pull/3046 which should be included in the next release, so it can work out of the box in more environments.

elastic / cloud-on-k8s

Operator crashes with OOM error because of small limit #1468

Bug Report