Closed dhrp closed 4 years ago
Thanks for the report. Out of curiosity:
I've been seeing the same issue over the past few days.
In my case, the increased memory usage seems to be correlated to an increase in the number of nodes being managed by the operator (currently 10).
On average, it takes about 30-40 minutes for the pod to be killed due to OOM errors.
Increasing the limit seems to have resolved things for me as well.
@philrhinehart Can you provide more information about the workload? Is it just one Elasticsearch cluster of 10 nodes managed by the operator? What is the Kubernetes cluster utilization? If it is possible to do so without revealing any sensitive information, can you provide the manifest for Elasticsearch as well?
I am closing this issue for now since our internal testing couldn't reproduce the problem. If anybody else experiences the same problem, please re-open this issue and provide details about the environment.
my elastic operator kept oom too, removed the requirements it is sitting at 188MB memory, one cluster, one node, https://www.elastic.co/guide/en/cloud-on-k8s/1.0/k8s-quickstart.html
@masterkain can you provide more details about your environment? Kubernetes version, self-hosted or cloud, other workloads in the cluster, whether this is a fresh install of ECK etc.
Happening to me too.
ECK Version: 1.0.0-beta1
15:29:32 ❯ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-aae39f", GitCommit:"aae39f4697508697bf16c0de4a5687d464f4da81", GitTreeState:"clean", BuildDate:"2019-12-23T08:19:12Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
AWS EKS Operator running a single 3 node cluster, 1 node apm server, and kibana.
Containers:
manager:
Container ID: docker://f2b005e92e2423b1ee0b9bb829bf20a60500233007409f114e39e4f9b0744823
Image: docker.elastic.co/eck/eck-operator:1.0.0-beta1
Image ID: docker-pullable://docker.elastic.co/eck/eck-operator@sha256:1b612a5ae47fb93144d0ab1dea658c94e87e9eedc9449552fabad2205eee3ed8
Port: 9876/TCP
Host Port: 0/TCP
Args:
manager
--operator-roles
all
--enable-debug-logs=false
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 11 Feb 2020 10:00:58 -0500
Finished: Tue, 11 Feb 2020 10:01:00 -0500
Ready: False
Restart Count: 8
Limits:
cpu: 1
memory: 150Mi
Requests:
cpu: 100m
memory: 50Mi
Environment:
OPERATOR_NAMESPACE: elastic-system (v1:metadata.namespace)
WEBHOOK_SECRET: webhook-server-secret
WEBHOOK_PODS_LABEL: elastic-operator
OPERATOR_IMAGE: docker.elastic.co/eck/eck-operator:1.0.0-beta1
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from elastic-operator-token-gdnj4 (ro)
Anything else I can provide, let me know.
@mcfearsome there were multiple changes that reduced memory usage that landed in v1.0.0 (especially https://www.elastic.co/guide/en/cloud-on-k8s/1.0-beta/release-highlights-1.0.0-beta1.html#k8s_memory_leak_in_the_eck_process), I would recommend upgrading and/or increasing the memory limit.
Same problem here. I couldn't deploy the CRD because the resulting pod was continually killed by OOM.
Which is strange because I had a 2 nodes (Kubernetes) cluster and the operator had no issues, and on this new cluster that has 7 (kubernetes) nodes and more CPU/RAM it gets killed every 30 sec...
I removed the limits/requests section for now and everything seems to be back to normal.
@Docteur-RS can you provide more details about your Kubernetes environment? Which version of ECK are you using?
@sebgl Using ECK 1.0 on premise. Kubernetes : 1.16.7 Each kubernetes nodes has about 16 gig of RAM
I updated the default resources to the following and it did not work either:
resources:
limits:
cpu: 1
memory: 350Mi
requests:
cpu: 500m
memory: 300Mi
I pretty much doubled everything...
In the end I just commented the whole "resources" section and it fixed the OOM.
Could you give us more information about the OS you are using:
lsb_release -a
or cat /etc/os-release
uname -a
dmesg
, it looks like that:[3458703.013937] Task in ... killed as a result of limit of ....
[3458703.039061] memory: usage x, limit x, failcnt x3458703.044979] memory+swap: usage x, limit x, failcnt x
[3458703.051495] kmem: usage x, limit x, failcnt 0
[3458703.058078] Memory cgroup stats for x ... active_file:0KB unevictable:0KB
....
[3458703.135532] oom_reaper: reaped process x (x), now anon-rss:x, file-rss:x, shmem-rss:0kB
@barkbay
Centos 7
Linux p4vm107 5.5.9-1.el7.elrepo.x86_64 #1 SMP Wed Mar 11 19:01:01 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
I applyed the limits / requests that failled and this is the logs I got:
kgp -n elastic-system -w
NAME READY STATUS RESTARTS AGE
elastic-operator-0 0/1 OOMKilled 2 55s
2 stacktraces that appears multiple times in dmesg
[767251.824704] Tasks state (memory values in pages):
[767251.825073] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[767251.825763] [ 26660] 0 26660 255 1 32768 0 -998 pause
[767251.826394] [ 27554] 101 27554 169888 94675 966656 0 982 elastic-operato
[767251.827344] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=d4f18256793a4c968094ffdac57fe51a09c439a2e9afe04aefec738b513d8005,mems_allowed=0,oom_memcg=/kubepods/burstable/pod7b717e63-8036-4ba9-af17-930bf5bab43b,task_memcg=/kubepods/burstable/pod7b717e63-8036-4ba9-af17-930bf5bab43b/d4f18256793a4c968094ffdac57fe51a09c439a2e9afe04aefec738b513d8005,task=elastic-operato,pid=27554,uid=101
[767251.830085] Memory cgroup out of memory: Killed process 27554 (elastic-operato) total-vm:679552kB, anon-rss:354896kB, file-rss:23804kB, shmem-rss:0kB, UID:101 pgtables:944kB oom_score_adj:982
[767251.841491] oom_reaper: reaped process 27554 (elastic-operato), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[767291.775788] elastic-operato invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=982
[767291.776576] CPU: 2 PID: 28225 Comm: elastic-operato Not tainted 5.5.9-1.el7.elrepo.x86_64 #1
[767291.777262] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[767291.777908] Call Trace:
[767291.778202] dump_stack+0x6d/0x98
[767291.778548] dump_header+0x51/0x210
[767291.778852] oom_kill_process+0x102/0x130
[767291.779194] out_of_memory+0x105/0x510
[767291.779517] mem_cgroup_out_of_memory+0xb9/0xd0
[767291.779888] try_charge+0x756/0x7c0
[767291.780211] ? __alloc_pages_nodemask+0x16c/0x320
[767291.780727] mem_cgroup_try_charge+0x72/0x1e0
[767291.781221] mem_cgroup_try_charge_delay+0x22/0x50
[767291.781750] do_anonymous_page+0x11a/0x650
[767291.782265] handle_pte_fault+0x2a8/0xad0
[767291.782754] __handle_mm_fault+0x4a8/0x680
[767291.783223] ? __switch_to_asm+0x40/0x70
[767291.783639] handle_mm_fault+0xea/0x200
[767291.784010] __do_page_fault+0x225/0x490
[767291.784458] do_page_fault+0x36/0x120
[767291.784845] page_fault+0x3e/0x50
[767291.785220] RIP: 0033:0x46055f
[767291.785572] Code: 00 00 c5 fe 6f 06 c5 fe 6f 4e 20 c5 fe 6f 56 40 c5 fe 6f 5e 60 48 81 c6 80 00 00 00 c5 fd e7 07 c5 fd e7 4f 20 c5 fd e7 57 40 <c5> fd e7 5f 60 48 81 c7 80 00 00 00 48 81 eb 80 00 00 00 77 b5 0f
[767291.787044] RSP: 002b:000000c000847118 EFLAGS: 00010202
[767291.787501] RAX: 0000000007fffe00 RBX: 0000000000bafde0 RCX: 000000c018e2be00
[767291.788085] RDX: 000000000ffffe00 RSI: 000000c01027c020 RDI: 000000c01827bfa0
[767291.788710] RBP: 000000c000847160 R08: 000000c010e2c000 R09: 0000000000000000
[767291.789327] R10: 0000000000000020 R11: 0000000000000202 R12: 0000000000000002
[767291.789937] R13: 00000000025731c0 R14: 000000000045eea0 R15: 0000000000000000
[767291.790745] memory: usage 358400kB, limit 358400kB, failcnt 1512
[767291.791234] memory+swap: usage 358400kB, limit 9007199254740988kB, failcnt 0
[767291.791823] kmem: usage 2744kB, limit 9007199254740988kB, failcnt 0
[767291.792299] Memory cgroup stats for /kubepods/burstable/pod7b717e63-8036-4ba9-af17-930bf5bab43b:
[767291.794894] anon 363929600
[767291.809817] Tasks state (memory values in pages):
[767291.810587] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[767291.811981] [ 26660] 0 26660 255 1 32768 0 -998 pause
[767291.813170] [ 28205] 101 28205 169888 94762 958464 0 982 elastic-operato
[767291.814479] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=fe06b305e97236ed3bceebfcf354d2aed79b729003caade99fd193d605c79407,mems_allowed=0,oom_memcg=/kubepods/burstable/pod7b717e63-8036-4ba9-af17-930bf5bab43b,task_memcg=/kubepods/burstable/pod7b717e63-8036-4ba9-af17-930bf5bab43b/fe06b305e97236ed3bceebfcf354d2aed79b729003caade99fd193d605c79407,task=elastic-operato,pid=28205,uid=101
[767291.818732] Memory cgroup out of memory: Killed process 28205 (elastic-operato) total-vm:679552kB, anon-rss:354792kB, file-rss:24256kB, shmem-rss:0kB, UID:101 pgtables:936kB oom_score_adj:982
[767291.827011] oom_reaper: reaped process 28205 (elastic-operato), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Centos 7
Linux p4vm107 5.5.9-1.el7.elrepo.x86_64 #1 SMP Wed Mar 11 19:01:01 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
Default Kernel for CentOS 7 is 3.10
(CentOS 8 is bundled with 4.18
)
5.5
has been released in January . Any reason to not use the default one ?
Kubernetes and container runtimes are relying on low level Kernel functions (like cgroups
)
I would not advise to use something else that the kernel provided by default for your distribution.
Hum... if I remember correctly we updated the Kernel version because Cilium (our kubernetes' CNI) needed BPF which was not availble in the default Kernel version we had.
Though I checked the cluster's version on which it was working correctly:
Linux p5vm7 5.4.12-1.el7.elrepo.x86_64 #1 SMP Tue Jan 14 16:02:20 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
And on the one it's not:
Linux p4vm107 5.5.9-1.el7.elrepo.x86_64 #1 SMP Wed Mar 11 19:01:01 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
Small difference but maybe it's only what it takes.
I think you can use Centos 8 if you want to use Cillium on CentOS. I'm closing this issue because I'm not sure we will be able to help for this kind of configuration (old distro + very recent Kernel)
Just wanted to add some metric points here:
ECK 1.0 on GKE - the operator kept OOMkilling itself, on average tries to use ~140Mi of memory, and I'm now trying to stabilize this with 200Mi of memory with guaranteed QoS.
We increased the default memory limits in https://github.com/elastic/cloud-on-k8s/pull/3046 which should be included in the next release, so it can work out of the box in more environments.
Bug Report
What did you do?
I simply followed the instructions here: https://www.elastic.co/elasticsearch-kubernetes
What did you expect to see? To have an elasticsearch node up and running.
What did you see instead? Under which circumstances?
I noticed the elastic-operator to be OOMKilled:
I noticed the memory limit set for the operator is also small (only 100Mb) https://github.com/elastic/cloud-on-k8s/blob/master/operators/config/operator/all-in-one/operator.template.yaml#L39
Environment
Script version:
https://download.elastic.co/downloads/eck/0.9.0/all-in-one.yaml
Kubernetes 1.13.5
Version information:
https://download.elastic.co/downloads/eck/0.9.0/all-in-one.yaml
EC2 on AWS, (not EKS), using Rancher 2.2.2. Kubernetes 1.13.5
Resource definition:
Logs:
Other notes: Interestingly enough I used the same operator on a different cluster on digital ocean, and there it didn't need more that the 100Mb limit.
I now increased the limit on my machine to 500M and it works well. (probably could have done less).