canonical / cos-lite-bundle

https://charmhub.io/cos-lite
Apache License 2.0
10 stars 10 forks source link

Prometheus charm comes up and then goes down after relations from another model are added. #103

Closed amc94 closed 5 months ago

amc94 commented 7 months ago

Bug Description

After Solutions QA successfully deploys the cos layer, we deploy another layer such as kubernetes or openstack. When cos-proxy is relating to prometheus. prometheus seems to go into an error state. Often juju status says 'installing agent' and the unit has message ' crash loop backoff: back-off 5m0s restarting failed container=prometheus pod=prometheus-0_cos'

some failed runs: https://solutions.qa.canonical.com/testruns/80f369b2-cf62-4eea-9aa8-79d6ce619ab7 https://solutions.qa.canonical.com/testruns/b2d5136c-032b-444e-bc63-38676f812450 https://solutions.qa.canonical.com/testruns/123275ec-4ee3-48b3-869d-3a6021611897

logs: https://oil-jenkins.canonical.com/artifacts/80f369b2-cf62-4eea-9aa8-79d6ce619ab7/index.html https://oil-jenkins.canonical.com/artifacts/b2d5136c-032b-444e-bc63-38676f812450/index.html https://oil-jenkins.canonical.com/artifacts/123275ec-4ee3-48b3-869d-3a6021611897/index.html

To Reproduce

On top of MAAS we bootsrtap a juju controller, we deploy microk8s v.1.29 and the COS on latest/stable and then either an openstack layer or kubernetes maas layer.

Environment

Both of these runs were on kvms

Relevant log output

So all runs have this as an output:
unit-prometheus-0: 2024-03-18 18:11:33 ERROR juju.worker.uniter pebble poll failed for container "prometheus": failed to get pebble info: cannot obtain system details: cannot communicate with server: Get "http://localhost/v1/system-info": dial unix /charm/containers/prometheus/pebble.socket: connect: connection refused

In a manual deployment I saw the following traceback before seeing the connection refused error, sorry I don't have the logs for this one:
Traceback (most recent call last):
  File "./src/charm.py", line 1074, in <module>
    main(PrometheusCharm)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/main.py", line 444, in main
    charm = charm_class(framework)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 292, in wrap_init
    original_init(self, framework, *args, **kwargs)
  File "./src/charm.py", line 165, in __init__
    self._update_cert()
  File "/var/lib/juju/agents/unit-prometheus-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 538, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "./src/charm.py", line 494, in _update_cert
    self.container.exec(["update-ca-certificates", "--fresh"]).wait()
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1464, in wait
    exit_code = self._wait()
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1474, in _wait
    change = self._client.wait_change(self._change_id, timeout=timeout)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1992, in wait_change
    return self._wait_change_using_wait(change_id, timeout)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 2013, in _wait_change_using_wait
    return self._wait_change(change_id, this_timeout)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 2027, in _wait_change
    resp = self._request('GET', f'/v1/changes/{change_id}/wait', query)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1754, in _request
    response = self._request_raw(method, path, query, headers, data)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1789, in _request_raw
    response = self.opener.open(request, timeout=self.timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 326, in http_open
    return self.do_open(_UnixSocketConnection, req,  # type:ignore
  File "/usr/lib/python3.8/urllib/request.py", line 1358, in do_open
    r = h.getresponse()
  File "/usr/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
Traceback (most recent call last):
  File "./src/charm.py", line 1074, in <module>
    main(PrometheusCharm)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/main.py", line 444, in main
    charm = charm_class(framework)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 292, in wrap_init
    original_init(self, framework, *args, **kwargs)
  File "./src/charm.py", line 165, in __init__
    self._update_cert()
  File "/var/lib/juju/agents/unit-prometheus-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 538, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "./src/charm.py", line 494, in _update_cert
    self.container.exec(["update-ca-certificates", "--fresh"]).wait()
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1464, in wait
    exit_code = self._wait()
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1474, in _wait
    change = self._client.wait_change(self._change_id, timeout=timeout)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1992, in wait_change
    return self._wait_change_using_wait(change_id, timeout)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 2013, in _wait_change_using_wait
    return self._wait_change(change_id, this_timeout)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 2027, in _wait_change
    resp = self._request('GET', f'/v1/changes/{change_id}/wait', query)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1754, in _request
    response = self._request_raw(method, path, query, headers, data)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1789, in _request_raw
    response = self.opener.open(request, timeout=self.timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 326, in http_open
    return self.do_open(_UnixSocketConnection, req,  # type:ignore
  File "/usr/lib/python3.8/urllib/request.py", line 1358, in do_open
    r = h.getresponse()
  File "/usr/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

Additional context

The main bug is prometheus falling back into a state of installing agent after it's already been set up. I'll keep adding testruns that I come across that have this error.

Abuelodelanada commented 7 months ago

Hi @amc94

Let me try to understand the situation.

Are you able to reproduce the same behaviour using edge instead of stable?

sed-i commented 7 months ago

In charm code we call self.container.exec(["update-ca-certificates", "--fresh"]).wait() behind a can_connect guard.

It is one of those cases that we deemed "ok to go into error state".

We often see pebble exceptions after can_connect guard when testing on a slow vm (although this is the first time I see http.client.RemoteDisconnected).

But the crash loop backoff is curious.

Is that a transient error? In the logs (1, 2, 3) it is active/idle.

amc94 commented 7 months ago

Hi, I tried it edge instead of stable and managed to run into it again. Juju status: image image image from the cos-proxy logs: image from the telegraf monitoring cos-proxy: image

It's not necessarily two more layers, as seen in the first run where only a landscape layer is deployed.

that juju log output is collected 5 hours before the end of that run, so when the cos layer finished deployemnt, in the later output it shows `

Unit           Workload  Agent  Address     Ports      Message
controller/0*  active    idle   10.1.216.4  37017/TCP  
Model  Controller            Cloud/Region              Version  SLA          Timestamp
cos    foundations-microk8s  microk8s_cloud/localhost  3.1.7    unsupported  17:06:51Z

App           Version  Status   Scale  Charm                     Channel  Rev  Address         Exposed  Message
alertmanager  0.26.0   active       2  alertmanager-k8s          stable   101  10.152.183.99   no       
avalanche              active       2  avalanche-k8s             edge      39  10.152.183.56   no       
ca                     active       1  self-signed-certificates  edge     117  10.152.183.227  no       
catalogue              active       1  catalogue-k8s             stable    33  10.152.183.89   no       
external-ca            active       1  self-signed-certificates  edge     117  10.152.183.212  no       
grafana       9.5.3    active       1  grafana-k8s               stable   105  10.152.183.116  no       
loki          2.9.4    active       1  loki-k8s                  stable   118  10.152.183.232  no       
prometheus    2.49.1   waiting      1  prometheus-k8s            stable   170  10.152.183.187  no       installing agent
traefik       2.10.5   active       1  traefik-k8s               stable   169  10.246.167.216  no       

Unit             Workload     Agent      Address      Ports  Message
alertmanager/0*  active       idle       10.1.81.16          
alertmanager/1   active       idle       10.1.216.9          
avalanche/0*     active       idle       10.1.81.11          
avalanche/1      active       idle       10.1.216.6          
ca/0*            active       idle       10.1.81.12          
catalogue/0*     active       idle       10.1.81.13          
external-ca/0*   active       idle       10.1.216.7          
grafana/0*       active       idle       10.1.216.10         
loki/0*          active       idle       10.1.89.5           
prometheus/0*    maintenance  executing  10.1.81.17          Configuring Prometheus
traefik/0*       active       idle       10.1.81.15          

Offer         Application   Charm             Rev  Connected  Endpoint              Interface                Role
alertmanager  alertmanager  alertmanager-k8s  101  0/0        karma-dashboard       karma_dashboard          provider
grafana       grafana       grafana-k8s       105  1/1        grafana-dashboard     grafana_dashboard        requirer
loki          loki          loki-k8s          118  1/1        logging               loki_push_api            provider
prometheus    prometheus    prometheus-k8s    170  2/2        metrics-endpoint      prometheus_scrape        requirer
                                                              receive-remote-write  prometheus_remote_write  provider`

and in the pods.txt in the cos crashdump it shows prometheus-0 1/2 CrashLoopBackOff 42 (34s ago) 5h46m

also sorry about the less than beautiful screenshots

sed-i commented 7 months ago

@amc94 from the screenshot it looks like prometheus was in error for about 40sec and then active/idle eventually? Can you confirm if this is a transient or persistent?

It would also be handy to see the output of describe pod to see the reason for the crashloop backoff

kubectl -n cos describe pod prometheus-0
amc94 commented 7 months ago
Name:             prometheus-0
Namespace:        cos
Priority:         0
Service Account:  prometheus
Node:             microk8s-27-3-3/10.246.167.163
Start Time:       Thu, 21 Mar 2024 15:09:31 +0000
Labels:           app.kubernetes.io/name=prometheus
                  apps.kubernetes.io/pod-index=0
                  controller-revision-hash=prometheus-7ff58f989c
                  statefulset.kubernetes.io/pod-name=prometheus-0
Annotations:      cni.projectcalico.org/containerID: c1bd838033801c0a6112899cd335f3c7859d545f8541e73be7936d2a58c2800b
                  cni.projectcalico.org/podIP: 10.1.81.8/32
                  cni.projectcalico.org/podIPs: 10.1.81.8/32
                  controller.juju.is/id: 5e202d63-f30a-41b1-8e96-023b50669e08
                  juju.is/version: 3.3.3
                  model.juju.is/id: 883d2661-9ec5-4f40-878f-38e0b778205c
                  unit.juju.is/id: prometheus/0
Status:           Running
IP:               10.1.81.8
IPs:
  IP:           10.1.81.8
Controlled By:  StatefulSet/prometheus
Init Containers:
  charm-init:
    Container ID:  containerd://0ed257779317430360e5a618330e69228ef2b3fa72e1e91717ac9d2cc4966a0d
    Image:         public.ecr.aws/juju/jujud-operator:3.3.3
    Image ID:      public.ecr.aws/juju/jujud-operator@sha256:0c48818b8aceb3a2c98cf0a79ae472a51d3ad74e217f348b5d948ab22cdf5937
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/containeragent
    Args:
      init
      --containeragent-pebble-dir
      /containeragent/pebble
      --charm-modified-version
      0
      --data-dir
      /var/lib/juju
      --bin-dir
      /charm/bin
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 21 Mar 2024 15:09:40 +0000
      Finished:     Thu, 21 Mar 2024 15:09:40 +0000
    Ready:          True
    Restart Count:  0
    Environment Variables from:
      prometheus-application-config  Secret  Optional: false
    Environment:
      JUJU_CONTAINER_NAMES:  prometheus
      JUJU_K8S_POD_NAME:     prometheus-0 (v1:metadata.name)
      JUJU_K8S_POD_UUID:      (v1:metadata.uid)
    Mounts:
      /charm/bin from charm-data (rw,path="charm/bin")
      /charm/containers from charm-data (rw,path="charm/containers")
      /containeragent/pebble from charm-data (rw,path="containeragent/pebble")
      /var/lib/juju from charm-data (rw,path="var/lib/juju")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bgxjs (ro)
Containers:
  charm:
    Container ID:  containerd://14d81c28503399b3cacde0f93a58dce331beb6ba5c769d47f264447b5c5b5cf0
    Image:         public.ecr.aws/juju/charm-base:ubuntu-20.04
    Image ID:      public.ecr.aws/juju/charm-base@sha256:2c3ca53095187fc456bb84b939a69cb1fadb829aaee1c5f200b7d42f1e75a304
    Port:          <none>
    Host Port:     <none>
    Command:
      /charm/bin/pebble
    Args:
      run
      --http
      :38812
      --verbose
    State:          Running
      Started:      Thu, 21 Mar 2024 15:09:41 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Readiness:      http-get http://:38812/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1
    Startup:        http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Environment:
      JUJU_CONTAINER_NAMES:  prometheus
      HTTP_PROBE_PORT:       3856
    Mounts:
      /charm/bin from charm-data (ro,path="charm/bin")
      /charm/containers from charm-data (rw,path="charm/containers")
      /var/lib/juju from charm-data (rw,path="var/lib/juju")
      /var/lib/juju/storage/database/0 from prometheus-database-5b4ad243 (rw)
      /var/lib/pebble/default from charm-data (rw,path="containeragent/pebble")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bgxjs (ro)
  prometheus:
    Container ID:  containerd://7bc1b456c12525a0a4c52aa9d0fc8a9cd50962e083572811735bcd04590b4ac6
    Image:         registry.jujucharms.com/charm/h9a0wskime1pr9ve26xf9oj0yp09xk5potmgk/prometheus-image@sha256:27753c83f6e9766fb3b0ff158a2da79f6e7a26b3f873c39facd724c07adf54bd
    Image ID:      registry.jujucharms.com/charm/h9a0wskime1pr9ve26xf9oj0yp09xk5potmgk/prometheus-image@sha256:27753c83f6e9766fb3b0ff158a2da79f6e7a26b3f873c39facd724c07adf54bd
    Port:          <none>
    Host Port:     <none>
    Command:
      /charm/bin/pebble
    Args:
      run
      --create-dirs
      --hold
      --http
      :38813
      --verbose
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 21 Mar 2024 22:40:34 +0000
      Finished:     Thu, 21 Mar 2024 22:41:30 +0000
    Ready:          False
    Restart Count:  57
    Limits:
      cpu:     250m
      memory:  209715200
    Requests:
      cpu:      250m
      memory:   200Mi
    Liveness:   http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Readiness:  http-get http://:38813/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1
    Environment:
      JUJU_CONTAINER_NAME:  prometheus
      PEBBLE_SOCKET:        /charm/container/pebble.socket
    Mounts:
      /charm/bin/pebble from charm-data (ro,path="charm/bin/pebble")
      /charm/container from charm-data (rw,path="charm/containers/prometheus")
      /var/lib/prometheus from prometheus-database-5b4ad243 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bgxjs (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  prometheus-database-5b4ad243:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  prometheus-database-5b4ad243-prometheus-0
    ReadOnly:   false
  charm-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-bgxjs:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/arch=amd64
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                       From     Message
  ----     ------   ----                      ----     -------
  Warning  BackOff  3m51s (x1194 over 5h18m)  kubelet  Back-off restarting failed container prometheus in pod prometheus-0_cos(e46453e4-4594-49ad-8a5a-d425dad7e920)
amc94 commented 7 months ago

@sed-i it's persistent, it hits idle/active for a small amount of time after a restart

sed-i commented 7 months ago

Thanks @amc94, we have another hint - prometheus is being OOMKilled:

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled

Any chance prometheus has accumulated a large WAL that doesn't fit into memory (could you attach the output of juju config avalanche?)? You could check with:

juju ssh --container prometheus prometheus/0 du -hs /var/lib/prometheus/wal

This type of failure could be more obvious if you apply resource limits to the pod:

juju config prometheus cpu=2 memory=4Gi
amc94 commented 7 months ago
application: avalanche
application-config: 
  juju-application-path: 
    default: /
    description: the relative http path used to access an application
    source: default
    type: string
    value: /
  juju-external-hostname: 
    description: the external hostname of an exposed application
    source: unset
    type: string
  kubernetes-ingress-allow-http: 
    default: false
    description: whether to allow HTTP traffic to the ingress controller
    source: default
    type: bool
    value: false
  kubernetes-ingress-class: 
    default: nginx
    description: the class of the ingress controller to be used by the ingress resource
    source: default
    type: string
    value: nginx
  kubernetes-ingress-ssl-passthrough: 
    default: false
    description: whether to passthrough SSL traffic to the ingress controller
    source: default
    type: bool
    value: false
  kubernetes-ingress-ssl-redirect: 
    default: false
    description: whether to redirect SSL traffic to the ingress controller
    source: default
    type: bool
    value: false
  kubernetes-service-annotations: 
    description: a space separated set of annotations to add to the service
    source: unset
    type: attrs
  kubernetes-service-external-ips: 
    description: list of IP addresses for which nodes in the cluster will also accept
      traffic
    source: unset
    type: string
  kubernetes-service-externalname: 
    description: external reference that kubedns or equivalent will return as a CNAME
      record
    source: unset
    type: string
  kubernetes-service-loadbalancer-ip: 
    description: LoadBalancer will get created with the IP specified in this field
    source: unset
    type: string
  kubernetes-service-loadbalancer-sourceranges: 
    description: traffic through the load-balancer will be restricted to the specified
      client IPs
    source: unset
    type: string
  kubernetes-service-target-port: 
    description: name or number of the port to access on the pods targeted by the
      service
    source: unset
    type: string
  kubernetes-service-type: 
    description: determines how the Service is exposed
    source: unset
    type: string
  trust: 
    default: false
    description: Does this application have access to trusted credentials
    source: user
    type: bool
    value: true
charm: avalanche-k8s
settings: 
  label_count: 
    default: 10
    description: Number of labels per-metric.
    source: default
    type: int
    value: 10
  labelname_length: 
    default: 5
    description: Modify length of label names.
    source: default
    type: int
    value: 5
  metric_count: 
    default: 500
    description: Number of metrics to serve.
    source: user
    type: int
    value: 10
  metric_interval: 
    default: 3.6e+07
    description: |
      Change __name__ label values every {interval} seconds. Avalanche's CLI default value is 120, but this is too low and quickly overloads the scraper. Using 3600000 (10k hours ~ 1 year) in lieu of "inf" (never refresh).
    source: default
    type: int
    value: 3.6e+07
  metricname_length: 
    default: 5
    description: Modify length of metric names.
    source: default
    type: int
    value: 5
  series_count: 
    default: 10
    description: Number of series per-metric.
    source: user
    type: int
    value: 2
  series_interval: 
    default: 3.6e+07
    description: |
      Change series_id label values every {interval} seconds. Avalanche's CLI default value is 60, but this is too low and quickly overloads the scraper. Using 3600000 (10k hours ~ 1 year) in lieu of "inf" (never refresh).
    source: default
    type: int
    value: 3.6e+07
  value_interval: 
    default: 30
    description: Change series values every {interval} seconds.
    source: default
    type: int
    value: 30

16M /var/lib/prometheus/wal

sed-i commented 7 months ago

Yep, 500*10 = 5000 values every 30sec is not a high load at all, and the WAL reflects it. Can we dig a bit deeper? Could you share the output of:

amc94 commented 7 months ago

journalctl was empty for both {"conditions":[{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T07:14:26Z","status":"True","type":"Initialized"},{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T13:52:11Z","message":"containers with unready status: [prometheus]","reason":"ContainersNotReady","status":"False","type":"Ready"},{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T13:52:11Z","message":"containers with unready status: [prometheus]","reason":"ContainersNotReady","status":"False","type":"ContainersReady"},{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T07:14:13Z","status":"True","type":"PodScheduled"}],"containerStatuses":[{"containerID":"containerd://b97ff807f8b8738db2c91851d21deb317448ab489a9c2b81d161630c448fc20a","image":"public.ecr.aws/juju/charm-base:ubuntu-20.04","imageID":"public.ecr.aws/juju/charm-base@sha256:accafa4a09fea590ba0c5baba90fec90e6c51136fe772695e3724b3d8c879dd2","lastState":{},"name":"charm","ready":true,"restartCount":0,"started":true,"state":{"running":{"startedAt":"2024-03-22T07:14:26Z"}}},{"containerID":"containerd://ab166870ead535a311590ed8bec4ba71520fbbfb7895bbd72d3d78eca3e71ebd","image":"sha256:d09e269a1213ea7586369dfd16611f33823897871731d01588e1096e2c146614","imageID":"registry.jujucharms.com/charm/h9a0wskime1pr9ve26xf9oj0yp09xk5potmgk/prometheus-image@sha256:27753c83f6e9766fb3b0ff158a2da79f6e7a26b3f873c39facd724c07adf54bd","lastState":{"terminated":{"containerID":"containerd://ab166870ead535a311590ed8bec4ba71520fbbfb7895bbd72d3d78eca3e71ebd","exitCode":137,"finishedAt":"2024-03-22T13:52:10Z","reason":"OOMKilled","startedAt":"2024-03-22T13:51:21Z"}},"name":"prometheus","ready":false,"restartCount":48,"started":false,"state":{"waiting":{"message":"back-off 5m0s restarting failed container=prometheus pod=prometheus-0_cos(1513187a-9472-491c-a5d5-065665d3a8b4)","reason":"CrashLoopBackOff"}}}],"hostIP":"10.246.164.182","initContainerStatuses":[{"containerID":"containerd://32e5b91441deabf9e5a0f35b0c3f3be2c7203e2dd2efcebd56fe66d7bb9b82bd","image":"public.ecr.aws/juju/jujud-operator:3.3.3","imageID":"public.ecr.aws/juju/jujud-operator@sha256:2921a3ee54d7f7f7847a8e8bc9a132b1deb40ed32c37098694df68b9e1a6808b","lastState":{},"name":"charm-init","ready":true,"restartCount":0,"started":false,"state":{"terminated":{"containerID":"containerd://32e5b91441deabf9e5a0f35b0c3f3be2c7203e2dd2efcebd56fe66d7bb9b82bd","exitCode":0,"finishedAt":"2024-03-22T07:14:24Z","reason":"Completed","startedAt":"2024-03-22T07:14:24Z"}}}],"phase":"Running","podIP":"10.1.240.201","podIPs":[{"ip":"10.1.240.201"}],"qosClass":"Burstable","startTime":"2024-03-22T07:14:14Z"}

sed-i commented 7 months ago

Really odd to see "reason":"OOMKilled" and "restartCount":48 with such a small ingestion load. Anything noteworthy from prometheus itself?

 kubectl -n cos logs prometheus-0 -c prometheus
amc94 commented 7 months ago

@sed-i We've currently stopped deploying cos proxy so prometheus isn't hitting this issue. Could it be that cos-proxy was writing enough data in a single go that it caused prometheus to hit oom?

sed-i commented 7 months ago

(Technically, cos-proxy doesn't send metrics; cos-proxy sends scrape job specs over relation data to prometheus, and prometheus does the scraping.) It's possible that there are a lot of metrics to scrape, but I somehow doubt you hit that in a testing env.

It is much more likely that loki gets overloaded. When both prom and loki consume much resources, I've seen the oomkill algo selecting prometheus over loki.

From the jenkins logs you shared I couldn't spot the bundle yamls that are related to the cos charms. Would you be able to link them here?

amc94 commented 7 months ago

Thank you for explaining. The bundle file for openstack

lucabello commented 5 months ago

Have you seen this error recently?

amc94 commented 5 months ago

it has not