eclipse-che / che

Kubernetes based Cloud Development Environments for Enterprise Teams
http://eclipse.org/che
Eclipse Public License 2.0
6.99k stars 1.19k forks source link

Huge che-gateway RAM consumption (potential memory leak) #20606

Closed ibuziuk closed 2 years ago

ibuziuk commented 3 years ago

Describe the bug

Che Gateway RAM consumption on production clusters is 5 and 12 GB

image image

gateway container consumes more than 6GB of RAM vs only 73 Mb for the configbump

Che version

7.34

Steps to reproduce

N/A

Expected behavior

RAM consumption is around 200 - 500Mb for the pod

Runtime

OpenShift

Screenshots

PROMQL

sum(container_memory_working_set_bytes{container!="", namespace='codeready-workspaces-operator'}) by (container)

image

Installation method

OperatorHub

Environment

Dev Sandbox (workspaces.openshift.com)

Eclipse Che Logs

configbump logs are pretty laconic

{"level":"info","ts":1632895846.6313307,"logger":"controller-runtime.controller","caller":"zapr@v0.1.0/zapr.go:69","msg":"Starting EventSource","controller":"config-bump","source":"kind source: /, Kind="}
{"level":"info","ts":1632895847.0319622,"logger":"controller-runtime.controller","caller":"zapr@v0.1.0/zapr.go:69","msg":"Starting Controller","controller":"config-bump"}
{"level":"info","ts":1632895847.13215,"logger":"controller-runtime.controller","caller":"zapr@v0.1.0/zapr.go:69","msg":"Starting workers","controller":"config-bump","worker count":1}

gateway logs have been shared internally

Additional context

No response

sparkoo commented 3 years ago

What is the gateway image please?

ibuziuk commented 3 years ago

@sparkoo the reproducer is a product one registry.redhat.io/codeready-workspaces/traefik-rhel8@sha256:6704bd086f0d971ecedc1dd6dc7a90429231fdfa86579e742705b31cbedbd8b2

sparkoo commented 3 years ago

Looks like it's based on older traefik 2.3.2. We're now using 2.5.0 and there is already 2.5.3 realeased upstream. This may be hard as we will need to reproduce that, ideally on latest upstream version because Traefik maintainers will tell us "update to latest first then talk to us".

I'm interested if restarting the pod lower the memory consumption, or it quickly jumps up into this high value.

Also I think we will need to start monitor the treafik gateway, to better know what is it doing in our deployments https://doc.traefik.io/traefik/observability/metrics/overview/

ibuziuk commented 3 years ago

ideally on latest upstream version because Traefik maintainers will tell us "update to latest first then talk to us".

ack, we are going to update to the CRW 2.12 in the short and will keep monitoring

I'm interested if restarting the pod lower the memory consumption, or it quickly jumps up into this high value.

restarting the pod lowers the pod consumption, however it slowly, but steadily grows e.g. after the restart on October 7th the pod consumes almost 800MB atm

ibuziuk commented 3 years ago

@sparkoo @skabashnyuk the issue is still there on CRW 2.12.1

image

image

skabashnyuk commented 3 years ago

@ibuziuk yeh picture https://github.com/eclipse/che/issues/20606#issuecomment-959758445 is unpleasant. Can you set up some limits, let's say 4G. So Kubernetes will restart the pod by itself. Meanwhile, we will work on understanding why it happening.

ibuziuk commented 3 years ago

@skabashnyuk not sure how is it possible to set up on the running instance powered by OLM (for operator pod it is possible to patch csv, but not sure about the operands). Currently, I just restarted the pod - looks like the gateway pod is the only one without the container imits.

gazarenkov commented 2 years ago

As agreed with all stakeholders, let's start with setting container limits and see if it is a real memory-leak problem or just the way GC works.

sparkoo commented 2 years ago

I've merged the PR to set gateway containers resources https://github.com/eclipse-che/che-operator/pull/1276. I'm keeping this issue open, but I'm not actively working on it atm. I want to discuss with @ibuziuk whats current state on affected cluster and we may either close it or dig deeper.

ibuziuk commented 2 years ago

thanks, I guess the limits will land on the product only in 2.15 so, we can wait once production is running against the version of the product with the correct limits and monitor. +1 for not closing the issue until than

ibuziuk commented 2 years ago

the current state with 2.15.2 is that the gateway keeps restarting with OOM ~ once per day

image

Memory usage details for the last 2 days:

image

che-bot commented 2 years ago

Issues go stale after 180 days of inactivity. lifecycle/stale issues rot after an additional 7 days of inactivity and eventually close.

Mark the issue as fresh with /remove-lifecycle stale in a new comment.

If this issue is safe to close now please do so.

Moderators: Add lifecycle/frozen label to avoid stale mode.