Closed ibuziuk closed 2 years ago
What is the gateway image please?
@sparkoo the reproducer is a product one registry.redhat.io/codeready-workspaces/traefik-rhel8@sha256:6704bd086f0d971ecedc1dd6dc7a90429231fdfa86579e742705b31cbedbd8b2
Looks like it's based on older traefik 2.3.2. We're now using 2.5.0 and there is already 2.5.3 realeased upstream. This may be hard as we will need to reproduce that, ideally on latest upstream version because Traefik maintainers will tell us "update to latest first then talk to us".
I'm interested if restarting the pod lower the memory consumption, or it quickly jumps up into this high value.
Also I think we will need to start monitor the treafik gateway, to better know what is it doing in our deployments https://doc.traefik.io/traefik/observability/metrics/overview/
ideally on latest upstream version because Traefik maintainers will tell us "update to latest first then talk to us".
ack, we are going to update to the CRW 2.12 in the short and will keep monitoring
I'm interested if restarting the pod lower the memory consumption, or it quickly jumps up into this high value.
restarting the pod lowers the pod consumption, however it slowly, but steadily grows e.g. after the restart on October 7th the pod consumes almost 800MB atm
@sparkoo @skabashnyuk the issue is still there on CRW 2.12.1
@ibuziuk yeh picture https://github.com/eclipse/che/issues/20606#issuecomment-959758445 is unpleasant. Can you set up some limits, let's say 4G. So Kubernetes will restart the pod by itself. Meanwhile, we will work on understanding why it happening.
@skabashnyuk not sure how is it possible to set up on the running instance powered by OLM (for operator pod it is possible to patch csv, but not sure about the operands). Currently, I just restarted the pod - looks like the gateway pod is the only one without the container imits.
As agreed with all stakeholders, let's start with setting container limits and see if it is a real memory-leak problem or just the way GC works.
I've merged the PR to set gateway containers resources https://github.com/eclipse-che/che-operator/pull/1276. I'm keeping this issue open, but I'm not actively working on it atm. I want to discuss with @ibuziuk whats current state on affected cluster and we may either close it or dig deeper.
thanks, I guess the limits will land on the product only in 2.15 so, we can wait once production is running against the version of the product with the correct limits and monitor. +1 for not closing the issue until than
the current state with 2.15.2 is that the gateway keeps restarting with OOM ~ once per day
Memory usage details for the last 2 days:
Issues go stale after 180
days of inactivity. lifecycle/stale
issues rot after an additional 7
days of inactivity and eventually close.
Mark the issue as fresh with /remove-lifecycle stale
in a new comment.
If this issue is safe to close now please do so.
Moderators: Add lifecycle/frozen
label to avoid stale mode.
Describe the bug
Che Gateway RAM consumption on production clusters is 5 and 12 GB
gateway
container consumes more than 6GB of RAM vs only 73 Mb for theconfigbump
Che version
7.34
Steps to reproduce
N/A
Expected behavior
RAM consumption is around 200 - 500Mb for the pod
Runtime
OpenShift
Screenshots
PROMQL
sum(container_memory_working_set_bytes{container!="", namespace='codeready-workspaces-operator'}) by (container)
Installation method
OperatorHub
Environment
Dev Sandbox (workspaces.openshift.com)
Eclipse Che Logs
configbump logs are pretty laconic
gateway logs have been shared internally
Additional context
No response