Closed amc94 closed 5 months ago
Hi @amc94
Let me try to understand the situation.
Are you able to reproduce the same behaviour using edge
instead of stable
?
In charm code we call self.container.exec(["update-ca-certificates", "--fresh"]).wait()
behind a can_connect guard.
It is one of those cases that we deemed "ok to go into error state".
We often see pebble exceptions after can_connect guard when testing on a slow vm (although this is the first time I see http.client.RemoteDisconnected
).
But the crash loop backoff
is curious.
Is that a transient error? In the logs (1, 2, 3) it is active/idle.
Hi, I tried it edge instead of stable and managed to run into it again. Juju status: from the cos-proxy logs: from the telegraf monitoring cos-proxy:
It's not necessarily two more layers, as seen in the first run where only a landscape layer is deployed.
that juju log output is collected 5 hours before the end of that run, so when the cos layer finished deployemnt, in the later output it shows `
Unit Workload Agent Address Ports Message
controller/0* active idle 10.1.216.4 37017/TCP
Model Controller Cloud/Region Version SLA Timestamp
cos foundations-microk8s microk8s_cloud/localhost 3.1.7 unsupported 17:06:51Z
App Version Status Scale Charm Channel Rev Address Exposed Message
alertmanager 0.26.0 active 2 alertmanager-k8s stable 101 10.152.183.99 no
avalanche active 2 avalanche-k8s edge 39 10.152.183.56 no
ca active 1 self-signed-certificates edge 117 10.152.183.227 no
catalogue active 1 catalogue-k8s stable 33 10.152.183.89 no
external-ca active 1 self-signed-certificates edge 117 10.152.183.212 no
grafana 9.5.3 active 1 grafana-k8s stable 105 10.152.183.116 no
loki 2.9.4 active 1 loki-k8s stable 118 10.152.183.232 no
prometheus 2.49.1 waiting 1 prometheus-k8s stable 170 10.152.183.187 no installing agent
traefik 2.10.5 active 1 traefik-k8s stable 169 10.246.167.216 no
Unit Workload Agent Address Ports Message
alertmanager/0* active idle 10.1.81.16
alertmanager/1 active idle 10.1.216.9
avalanche/0* active idle 10.1.81.11
avalanche/1 active idle 10.1.216.6
ca/0* active idle 10.1.81.12
catalogue/0* active idle 10.1.81.13
external-ca/0* active idle 10.1.216.7
grafana/0* active idle 10.1.216.10
loki/0* active idle 10.1.89.5
prometheus/0* maintenance executing 10.1.81.17 Configuring Prometheus
traefik/0* active idle 10.1.81.15
Offer Application Charm Rev Connected Endpoint Interface Role
alertmanager alertmanager alertmanager-k8s 101 0/0 karma-dashboard karma_dashboard provider
grafana grafana grafana-k8s 105 1/1 grafana-dashboard grafana_dashboard requirer
loki loki loki-k8s 118 1/1 logging loki_push_api provider
prometheus prometheus prometheus-k8s 170 2/2 metrics-endpoint prometheus_scrape requirer
receive-remote-write prometheus_remote_write provider`
and in the pods.txt in the cos crashdump it shows prometheus-0 1/2 CrashLoopBackOff 42 (34s ago) 5h46m
also sorry about the less than beautiful screenshots
@amc94 from the screenshot it looks like prometheus was in error for about 40sec and then active/idle eventually? Can you confirm if this is a transient or persistent?
It would also be handy to see the output of describe pod
to see the reason for the crashloop backoff
kubectl -n cos describe pod prometheus-0
Name: prometheus-0
Namespace: cos
Priority: 0
Service Account: prometheus
Node: microk8s-27-3-3/10.246.167.163
Start Time: Thu, 21 Mar 2024 15:09:31 +0000
Labels: app.kubernetes.io/name=prometheus
apps.kubernetes.io/pod-index=0
controller-revision-hash=prometheus-7ff58f989c
statefulset.kubernetes.io/pod-name=prometheus-0
Annotations: cni.projectcalico.org/containerID: c1bd838033801c0a6112899cd335f3c7859d545f8541e73be7936d2a58c2800b
cni.projectcalico.org/podIP: 10.1.81.8/32
cni.projectcalico.org/podIPs: 10.1.81.8/32
controller.juju.is/id: 5e202d63-f30a-41b1-8e96-023b50669e08
juju.is/version: 3.3.3
model.juju.is/id: 883d2661-9ec5-4f40-878f-38e0b778205c
unit.juju.is/id: prometheus/0
Status: Running
IP: 10.1.81.8
IPs:
IP: 10.1.81.8
Controlled By: StatefulSet/prometheus
Init Containers:
charm-init:
Container ID: containerd://0ed257779317430360e5a618330e69228ef2b3fa72e1e91717ac9d2cc4966a0d
Image: public.ecr.aws/juju/jujud-operator:3.3.3
Image ID: public.ecr.aws/juju/jujud-operator@sha256:0c48818b8aceb3a2c98cf0a79ae472a51d3ad74e217f348b5d948ab22cdf5937
Port: <none>
Host Port: <none>
Command:
/opt/containeragent
Args:
init
--containeragent-pebble-dir
/containeragent/pebble
--charm-modified-version
0
--data-dir
/var/lib/juju
--bin-dir
/charm/bin
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 21 Mar 2024 15:09:40 +0000
Finished: Thu, 21 Mar 2024 15:09:40 +0000
Ready: True
Restart Count: 0
Environment Variables from:
prometheus-application-config Secret Optional: false
Environment:
JUJU_CONTAINER_NAMES: prometheus
JUJU_K8S_POD_NAME: prometheus-0 (v1:metadata.name)
JUJU_K8S_POD_UUID: (v1:metadata.uid)
Mounts:
/charm/bin from charm-data (rw,path="charm/bin")
/charm/containers from charm-data (rw,path="charm/containers")
/containeragent/pebble from charm-data (rw,path="containeragent/pebble")
/var/lib/juju from charm-data (rw,path="var/lib/juju")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bgxjs (ro)
Containers:
charm:
Container ID: containerd://14d81c28503399b3cacde0f93a58dce331beb6ba5c769d47f264447b5c5b5cf0
Image: public.ecr.aws/juju/charm-base:ubuntu-20.04
Image ID: public.ecr.aws/juju/charm-base@sha256:2c3ca53095187fc456bb84b939a69cb1fadb829aaee1c5f200b7d42f1e75a304
Port: <none>
Host Port: <none>
Command:
/charm/bin/pebble
Args:
run
--http
:38812
--verbose
State: Running
Started: Thu, 21 Mar 2024 15:09:41 +0000
Ready: True
Restart Count: 0
Liveness: http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
Readiness: http-get http://:38812/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1
Startup: http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
Environment:
JUJU_CONTAINER_NAMES: prometheus
HTTP_PROBE_PORT: 3856
Mounts:
/charm/bin from charm-data (ro,path="charm/bin")
/charm/containers from charm-data (rw,path="charm/containers")
/var/lib/juju from charm-data (rw,path="var/lib/juju")
/var/lib/juju/storage/database/0 from prometheus-database-5b4ad243 (rw)
/var/lib/pebble/default from charm-data (rw,path="containeragent/pebble")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bgxjs (ro)
prometheus:
Container ID: containerd://7bc1b456c12525a0a4c52aa9d0fc8a9cd50962e083572811735bcd04590b4ac6
Image: registry.jujucharms.com/charm/h9a0wskime1pr9ve26xf9oj0yp09xk5potmgk/prometheus-image@sha256:27753c83f6e9766fb3b0ff158a2da79f6e7a26b3f873c39facd724c07adf54bd
Image ID: registry.jujucharms.com/charm/h9a0wskime1pr9ve26xf9oj0yp09xk5potmgk/prometheus-image@sha256:27753c83f6e9766fb3b0ff158a2da79f6e7a26b3f873c39facd724c07adf54bd
Port: <none>
Host Port: <none>
Command:
/charm/bin/pebble
Args:
run
--create-dirs
--hold
--http
:38813
--verbose
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Thu, 21 Mar 2024 22:40:34 +0000
Finished: Thu, 21 Mar 2024 22:41:30 +0000
Ready: False
Restart Count: 57
Limits:
cpu: 250m
memory: 209715200
Requests:
cpu: 250m
memory: 200Mi
Liveness: http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
Readiness: http-get http://:38813/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1
Environment:
JUJU_CONTAINER_NAME: prometheus
PEBBLE_SOCKET: /charm/container/pebble.socket
Mounts:
/charm/bin/pebble from charm-data (ro,path="charm/bin/pebble")
/charm/container from charm-data (rw,path="charm/containers/prometheus")
/var/lib/prometheus from prometheus-database-5b4ad243 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bgxjs (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
prometheus-database-5b4ad243:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: prometheus-database-5b4ad243-prometheus-0
ReadOnly: false
charm-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-bgxjs:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/arch=amd64
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 3m51s (x1194 over 5h18m) kubelet Back-off restarting failed container prometheus in pod prometheus-0_cos(e46453e4-4594-49ad-8a5a-d425dad7e920)
@sed-i it's persistent, it hits idle/active for a small amount of time after a restart
Thanks @amc94, we have another hint - prometheus is being OOMKilled:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Any chance prometheus has accumulated a large WAL that doesn't fit into memory (could you attach the output of juju config avalanche
?)?
You could check with:
juju ssh --container prometheus prometheus/0 du -hs /var/lib/prometheus/wal
This type of failure could be more obvious if you apply resource limits to the pod:
juju config prometheus cpu=2 memory=4Gi
application: avalanche
application-config:
juju-application-path:
default: /
description: the relative http path used to access an application
source: default
type: string
value: /
juju-external-hostname:
description: the external hostname of an exposed application
source: unset
type: string
kubernetes-ingress-allow-http:
default: false
description: whether to allow HTTP traffic to the ingress controller
source: default
type: bool
value: false
kubernetes-ingress-class:
default: nginx
description: the class of the ingress controller to be used by the ingress resource
source: default
type: string
value: nginx
kubernetes-ingress-ssl-passthrough:
default: false
description: whether to passthrough SSL traffic to the ingress controller
source: default
type: bool
value: false
kubernetes-ingress-ssl-redirect:
default: false
description: whether to redirect SSL traffic to the ingress controller
source: default
type: bool
value: false
kubernetes-service-annotations:
description: a space separated set of annotations to add to the service
source: unset
type: attrs
kubernetes-service-external-ips:
description: list of IP addresses for which nodes in the cluster will also accept
traffic
source: unset
type: string
kubernetes-service-externalname:
description: external reference that kubedns or equivalent will return as a CNAME
record
source: unset
type: string
kubernetes-service-loadbalancer-ip:
description: LoadBalancer will get created with the IP specified in this field
source: unset
type: string
kubernetes-service-loadbalancer-sourceranges:
description: traffic through the load-balancer will be restricted to the specified
client IPs
source: unset
type: string
kubernetes-service-target-port:
description: name or number of the port to access on the pods targeted by the
service
source: unset
type: string
kubernetes-service-type:
description: determines how the Service is exposed
source: unset
type: string
trust:
default: false
description: Does this application have access to trusted credentials
source: user
type: bool
value: true
charm: avalanche-k8s
settings:
label_count:
default: 10
description: Number of labels per-metric.
source: default
type: int
value: 10
labelname_length:
default: 5
description: Modify length of label names.
source: default
type: int
value: 5
metric_count:
default: 500
description: Number of metrics to serve.
source: user
type: int
value: 10
metric_interval:
default: 3.6e+07
description: |
Change __name__ label values every {interval} seconds. Avalanche's CLI default value is 120, but this is too low and quickly overloads the scraper. Using 3600000 (10k hours ~ 1 year) in lieu of "inf" (never refresh).
source: default
type: int
value: 3.6e+07
metricname_length:
default: 5
description: Modify length of metric names.
source: default
type: int
value: 5
series_count:
default: 10
description: Number of series per-metric.
source: user
type: int
value: 2
series_interval:
default: 3.6e+07
description: |
Change series_id label values every {interval} seconds. Avalanche's CLI default value is 60, but this is too low and quickly overloads the scraper. Using 3600000 (10k hours ~ 1 year) in lieu of "inf" (never refresh).
source: default
type: int
value: 3.6e+07
value_interval:
default: 30
description: Change series values every {interval} seconds.
source: default
type: int
value: 30
16M /var/lib/prometheus/wal
Yep, 500*10 = 5000 values every 30sec is not a high load at all, and the WAL reflects it. Can we dig a bit deeper? Could you share the output of:
journalctl | grep eviction
journalctl --no-pager -kqg 'killed process' -o verbose --output-fields=MESSAGE
kubectl get pod prometheus-0 -o=jsonpath='{.status}' -n cos
journalctl was empty for both
{"conditions":[{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T07:14:26Z","status":"True","type":"Initialized"},{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T13:52:11Z","message":"containers with unready status: [prometheus]","reason":"ContainersNotReady","status":"False","type":"Ready"},{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T13:52:11Z","message":"containers with unready status: [prometheus]","reason":"ContainersNotReady","status":"False","type":"ContainersReady"},{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T07:14:13Z","status":"True","type":"PodScheduled"}],"containerStatuses":[{"containerID":"containerd://b97ff807f8b8738db2c91851d21deb317448ab489a9c2b81d161630c448fc20a","image":"public.ecr.aws/juju/charm-base:ubuntu-20.04","imageID":"public.ecr.aws/juju/charm-base@sha256:accafa4a09fea590ba0c5baba90fec90e6c51136fe772695e3724b3d8c879dd2","lastState":{},"name":"charm","ready":true,"restartCount":0,"started":true,"state":{"running":{"startedAt":"2024-03-22T07:14:26Z"}}},{"containerID":"containerd://ab166870ead535a311590ed8bec4ba71520fbbfb7895bbd72d3d78eca3e71ebd","image":"sha256:d09e269a1213ea7586369dfd16611f33823897871731d01588e1096e2c146614","imageID":"registry.jujucharms.com/charm/h9a0wskime1pr9ve26xf9oj0yp09xk5potmgk/prometheus-image@sha256:27753c83f6e9766fb3b0ff158a2da79f6e7a26b3f873c39facd724c07adf54bd","lastState":{"terminated":{"containerID":"containerd://ab166870ead535a311590ed8bec4ba71520fbbfb7895bbd72d3d78eca3e71ebd","exitCode":137,"finishedAt":"2024-03-22T13:52:10Z","reason":"OOMKilled","startedAt":"2024-03-22T13:51:21Z"}},"name":"prometheus","ready":false,"restartCount":48,"started":false,"state":{"waiting":{"message":"back-off 5m0s restarting failed container=prometheus pod=prometheus-0_cos(1513187a-9472-491c-a5d5-065665d3a8b4)","reason":"CrashLoopBackOff"}}}],"hostIP":"10.246.164.182","initContainerStatuses":[{"containerID":"containerd://32e5b91441deabf9e5a0f35b0c3f3be2c7203e2dd2efcebd56fe66d7bb9b82bd","image":"public.ecr.aws/juju/jujud-operator:3.3.3","imageID":"public.ecr.aws/juju/jujud-operator@sha256:2921a3ee54d7f7f7847a8e8bc9a132b1deb40ed32c37098694df68b9e1a6808b","lastState":{},"name":"charm-init","ready":true,"restartCount":0,"started":false,"state":{"terminated":{"containerID":"containerd://32e5b91441deabf9e5a0f35b0c3f3be2c7203e2dd2efcebd56fe66d7bb9b82bd","exitCode":0,"finishedAt":"2024-03-22T07:14:24Z","reason":"Completed","startedAt":"2024-03-22T07:14:24Z"}}}],"phase":"Running","podIP":"10.1.240.201","podIPs":[{"ip":"10.1.240.201"}],"qosClass":"Burstable","startTime":"2024-03-22T07:14:14Z"}
Really odd to see "reason":"OOMKilled"
and "restartCount":48
with such a small ingestion load.
Anything noteworthy from prometheus itself?
kubectl -n cos logs prometheus-0 -c prometheus
@sed-i We've currently stopped deploying cos proxy so prometheus isn't hitting this issue. Could it be that cos-proxy was writing enough data in a single go that it caused prometheus to hit oom?
(Technically, cos-proxy doesn't send metrics; cos-proxy sends scrape job specs over relation data to prometheus, and prometheus does the scraping.) It's possible that there are a lot of metrics to scrape, but I somehow doubt you hit that in a testing env.
It is much more likely that loki gets overloaded. When both prom and loki consume much resources, I've seen the oomkill algo selecting prometheus over loki.
From the jenkins logs you shared I couldn't spot the bundle yamls that are related to the cos charms. Would you be able to link them here?
Thank you for explaining. The bundle file for openstack
Have you seen this error recently?
it has not
Bug Description
After Solutions QA successfully deploys the cos layer, we deploy another layer such as kubernetes or openstack. When cos-proxy is relating to prometheus. prometheus seems to go into an error state. Often juju status says 'installing agent' and the unit has message ' crash loop backoff: back-off 5m0s restarting failed container=prometheus pod=prometheus-0_cos'
some failed runs: https://solutions.qa.canonical.com/testruns/80f369b2-cf62-4eea-9aa8-79d6ce619ab7 https://solutions.qa.canonical.com/testruns/b2d5136c-032b-444e-bc63-38676f812450 https://solutions.qa.canonical.com/testruns/123275ec-4ee3-48b3-869d-3a6021611897
logs: https://oil-jenkins.canonical.com/artifacts/80f369b2-cf62-4eea-9aa8-79d6ce619ab7/index.html https://oil-jenkins.canonical.com/artifacts/b2d5136c-032b-444e-bc63-38676f812450/index.html https://oil-jenkins.canonical.com/artifacts/123275ec-4ee3-48b3-869d-3a6021611897/index.html
To Reproduce
On top of MAAS we bootsrtap a juju controller, we deploy microk8s v.1.29 and the COS on latest/stable and then either an openstack layer or kubernetes maas layer.
Environment
Both of these runs were on kvms
Relevant log output
Additional context
The main bug is prometheus falling back into a state of installing agent after it's already been set up. I'll keep adding testruns that I come across that have this error.