canonical / istio-operators

Charmed Istio
2 stars 17 forks source link

istio ingressgateway's envoy process is taking more memory than defined in the charm's pod manifest #376

Open nishant-dash opened 8 months ago

nishant-dash commented 8 months ago

Bug Description

I am running into an issue where the istio-ingressgateway-workload pod/container is crashlooping since its get OOM-killed.

istio-ingressgateway-workload-5dcdfb989-d52q2          1/1     Running   683 (7m23s ago)   46d

I manually patched the deployment to use 2Gi instead of 1Gi, and after a few hours of monitoring, mem usage has been continuously but very slowly increasing. As of writing this, its gone from 1019280 -> 1032556 -> 1056748 and so far it has never stopped decreasing (in the past few hrs its been running).

To Reproduce

Hard to say since its a complicated deployment thats evolved over months and has lot of workload on it.

Environment

App                        Version                         Status   Scale  Charm                    Channel             Rev  Address         Exposed  Message
admission-webhook          res:oci-image@2d74d1b           active       1  admission-webhook        1.7/stable          205  
argo-controller            res:oci-image@669ebd5           active       1  argo-controller          3.3/stable          236                  no       
argo-server                res:oci-image@576d038           active       1  argo-server              3.3/stable          185                  no       
dex-auth                                                   active       1  dex-auth                 2.31/stable         224  
grafana-agent-k8s          0.32.1                          waiting      1  grafana-agent-k8s        latest/stable        38  
istio-ingressgateway                                       active       1  istio-gateway            1.16/stable         551  
istio-pilot                                                active       1  istio-pilot              1.16/stable         551  
jupyter-controller         res:oci-image@1167186           active       1  jupyter-controller       1.7/stable          607                  no       
jupyter-ui                 .../9lw7s63ewtlyew486jjn1ez...  active       1  jupyter-ui                                    25  
katib-controller           res:oci-image@111495a           active       1  katib-controller         0.15/stable         282  
katib-db                   mariadb/server:
katib-db-manager           res:oci-image@16b33a5           active       1  katib-db-manager         0.15/stable         253  
katib-ui                   res:oci-image@c7dc04a           active       1  katib-ui                 0.15/stable         267  
kfp-api                    res:oci-image@bf747d5           active       1  kfp-api                  2.0-alpha.7/stable  935  
kfp-db                     mariadb/server:
kfp-persistence            res:oci-image@ebed770           active       1  kfp-persistence          2.0-alpha.7/stable  939                  no       
kfp-profile-controller     res:oci-image@aa75b0c           active       1  kfp-profile-controller   2.0-alpha.7/stable  899  
kfp-schedwf                res:oci-image@2cb9087           active       1  kfp-schedwf              2.0-alpha.7/stable  952                  no       
kfp-ui                     res:oci-image@ae72602           active       1  kfp-ui                   2.0-alpha.7/stable  934  
kfp-viewer                 res:oci-image@899e25f           active       1  kfp-viewer               2.0-alpha.7/stable  964                  no       
kfp-viz                    res:oci-image@ffaf37e           active       1  kfp-viz                  2.0-alpha.7/stable  889  
knative-eventing                                           active       1  knative-eventing         1.8/stable          224  
knative-operator                                           active       1  knative-operator         1.8/stable          199  
knative-serving                                            active       1  knative-serving          1.8/stable          224  
kserve-controller                                          active       1  kserve-controller        0.
kubeflow-dashboard         res:oci-image@6fe6eec           active       1  kubeflow-dashboard       1.7/stable          307  
kubeflow-profiles          res:profile-image@cfd6935       active       1  kubeflow-profiles        1.7/stable          269  
kubeflow-roles                                             active       1  kubeflow-roles           1.7/stable          113  
kubeflow-volumes           res:oci-image@d261609           active       1  kubeflow-volumes         1.7/stable          178  
metacontroller-operator                                    active       1  metacontroller-operator  2.0/stable          117  
minio                      res:oci-image@1755999           active       1  minio                    ckf-1.7/stable      186  
namespace-node-affinity                                    active       1  namespace-node-affinity  0.1/beta              5  
oidc-gatekeeper            res:oci-image@6b720b8           active       1  oidc-gatekeeper          ckf-1.7/stable      176  
seldon-controller-manager  res:oci-image@eb811b6           active       1  seldon-core              1.15/stable         354  
tensorboard-controller     res:oci-image@c52f7c2           active       1  tensorboard-controller   1.7/stable          156  
tensorboards-web-app       res:oci-image@929f55b           active       1  tensorboards-web-app     1.7/stable          158  
training-operator                                          active       1  training-operator        1.6/stable          215

the jupyter ui charm is a custom charm thats based off of the regular jupyter charm revision that kf 1.7/stable tracked (a few months ago) with modified spawner ui config.yaml (also I have intentionally hidden the addresses)

Relevant Log Output

from container logs

2024-01-29T11:30:33.035914Z     warn    Envoy may have been out of memory killed. Check memory usage and limits.
2024-01-29T11:30:33.036001Z     error   Envoy exited with error: signal: killed

from pod description

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    0
  Started:      Mon, 29 Jan 2024 11:37:38 +0000
  Finished:     Mon, 29 Jan 2024 11:41:11 +0000

Additional Context

No response

syncronize-issues-to-jira[bot] commented 8 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5262.

This message was autogenerated

nishant-dash commented 8 months ago

this still continues to rise, currently at 1103120 KB