Closed DnPlas closed 9 months ago
On track/2.0
the following failures observed:
Model Controller Cloud/Region Version SLA Timestamp
test-charm-yjgo github-pr-77e6a-microk8s microk8s/localhost 2.9.44 unsupported 11:52:17Z
App Version Status Scale Charm Channel Rev Address Exposed Message
grafana-k8s 9.2.1 active 1 grafana-k8s latest/stable 81 10.152.183.115 no
kfp-api waiting 1 kfp-api 0 10.152.183.6 no installing agent
kfp-db mariadb/server:10.3 active 1 charmed-osm-mariadb-k8s latest/stable 35 10.152.183.232 no ready
kfp-viz res:oci-image@3de6f3c blocked 0/1 kfp-viz 2.0/stable 476 10.152.183.34 no 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes ...
minio res:oci-image@1755999 blocked 0/1 minio ckf-1.7/stable 186 10.152.183.121 no 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes ...
mysql-k8s 8.0.32-0ubuntu0.22.04.2 waiting 1 mysql-k8s 8.0/edge 88 10.152.183.231 no waiting for units to settle down
prometheus-k8s 2.33.5 active 1 prometheus-k8s latest/stable 103 10.152.183.141 no
prometheus-scrape-config-k8s n/a active 1 prometheus-scrape-config-k8s latest/stable 39 10.152.183.123 no
Unit Workload Agent Address Ports Message
grafana-k8s/0\* active idle 10.1.76.158
kfp-api/0\* blocked idle 10.1.76.137 Please add required database relation: eg. relational-db
kfp-db/0\* active idle 10.1.76.145 3306/TCP ready
kfp-viz/0 unknown lost 10.1.76.146 8888/TCP agent lost, see 'juju show-status-log kfp-viz/0'
minio/0 unknown lost 10.1.76.149 9000/TCP,9001/TCP agent lost, see 'juju show-status-log minio/0'
mysql-k8s/0\* error idle 10.1.76.160 hook failed: "database-relation-changed"
prometheus-k8s/0\* active idle 10.1.76.157
prometheus-scrape-config-k8s/0\* active idle 10.1.76.153
Thanks @i-chvets, I was exactly looking for this kind of log minio res:oci-image@1755999 blocked 0/1 minio ckf-1.7/stable 186 10.152.183.121 no 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes ...
. That is a symptom of the machine running out of resources. We shall investigate a bit more.
Something feels a bit funny here. afaik these VMs should consistently get the same disk space, so it feels odd that we'd have an intermittent disk space error. But I don't have a better idea of what is going on
Other components also can experience similar issues:
kfp-viz res:oci-image@3de6f3c blocked 0/1 kfp-viz 2.0/stable 476 10.152.183.34 no 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes ...
minio res:oci-image@1755999 blocked 0/1 minio ckf-1.7/stable 186 10.152.183.121 no 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes ...
Something feels a bit funny here. afaik these VMs should consistently get the same disk space, so it feels odd that we'd have an intermittent disk space error. But I don't have a better idea of what is going on
It is not really a VM disk issue, but storage in MicroK8s.
but storage in MicroK8s.
I want a cookie for that :)
Another random error in KFP:
kfp-profile-controller python:3.7 error 1 kfp-profile-controller 0 no creating or updating custom resources: getting custom resources: attempt count exceeded: getting custom resource defi...
This is our charm that tries to apply K8S resources and fails after many attempts. I will add more logs, when they are uploaded. test is still going and failing all over the place.
Latest in KFP:
minio/0* error idle 10.1.170.223 9000/TCP,9001/TCP unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/81j63o4a2ldarn1umc...
Corresponding log: https://pastebin.canonical.com/p/MbXsNbz6RY/
Interestingly enough, I ran into something very similar today in the dex-auth-operator repo. Look at this CI execution.
I came across this error here with argo-server. Attaching any logs I find related but if you think there are more, let me know. Error:
argo-server/0* error idle 10.1.21.142 2746/TCP unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/hs4bjttgix7e2kf318...
juju status logs-unit-argo-server-0
04 Aug 2023 11:15:15Z juju-unit executing running config-changed hook
04 Aug 2023 11:15:18Z workload maintenance Setting pod spec
04 Aug 2023 11:15:21Z juju-unit executing running start hook
04 Aug 2023 11:15:24Z juju-unit idle
04 Aug 2023 11:17:25Z juju-unit error OCI image pull error: rpc error: code = Unknown desc = failed to pull and unpack image "registry.jujucharms.com/charm/hs4bjttgix7e2kf3188j168vjafe2mfmr8m16/oci-image@sha256:576d03880b4d608b00607902be8f52692e2b8d40f9fdc21992b65447a93614c2": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.jujucharms.com/v2/charm/hs4bjttgix7e2kf3188j168vjafe2mfmr8m16/oci-image/blobs/sha256:164db2618c02555da834fe4893b2922dba49f830ff886a4084ad515a2365a19e: 502 Proxy Error
04 Aug 2023 11:17:25Z workload error OCI image pull error: rpc error: code = Unknown desc = failed to pull and unpack image "registry.jujucharms.com/charm/hs4bjttgix7e2kf3188j168vjafe2mfmr8m16/oci-image@sha256:576d03880b4d608b00607902be8f52692e2b8d40f9fdc21992b65447a93614c2": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.jujucharms.com/v2/charm/hs4bjttgix7e2kf3188j168vjafe2mfmr8m16/oci-image/blobs/sha256:164db2618c02555da834fe4893b2922dba49f830ff886a4084ad515a2365a19e: 502 Proxy Error
04 Aug 2023 11:17:38Z juju-unit error unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/hs4bjttgix7e2kf3188j168vjafe2mfmr8m16/oci-image@sha256:576d03880b4d608b00607902be8f52692e2b8d40f9fdc21992b65447a93614c2"
04 Aug 2023 11:17:38Z workload error unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/hs4bjttgix7e2kf3188j168vjafe2mfmr8m16/oci-image@sha256:576d03880b4d608b00607902be8f52692e2b8d40f9fdc21992b65447a93614c2"
Couldn't find anything that could be useful in juju debug-log
but let me know if you 'd like to investigate yourself.
Just bumped into this for a different charm in a VM of my own running MicroK8s:
juju show-status mlflow-minio
Model Controller Cloud/Region Version SLA Timestamp
kubeflow uk8s-controller microk8s/localhost 2.9.44 unsupported 08:28:37Z
App Version Status Scale Charm Channel Rev Address Exposed Message
mlflow-minio res:oci-image@1755999 waiting 1 minio ckf-1.7/edge 186 10.152.183.144 no
Unit Workload Agent Address Ports Message
mlflow-minio/0* error idle 10.1.154.150 9000/TCP,9001/TCP unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/81j63o4a2ldarn1umc...
So this is neither KFP nor GH runner specific.
Thanks @i-chvets, I was exactly looking for this kind of log
minio res:oci-image@1755999 blocked 0/1 minio ckf-1.7/stable 186 10.152.183.121 no 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes ...
. That is a symptom of the machine running out of resources. We shall investigate a bit more.
I've seen this failure in some of our Juju CI before (just simple bootstrap/smoke tests). Probably something to do with K8s storage on GH runners, rather than an issue with the specific charm.
Thanks @i-chvets, I was exactly looking for this kind of log
minio res:oci-image@1755999 blocked 0/1 minio ckf-1.7/stable 186 10.152.183.121 no 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes ...
. That is a symptom of the machine running out of resources. We shall investigate a bit more.I've seen this failure in some of our Juju CI before (just simple bootstrap/smoke tests). Probably something to do with K8s storage on GH runners, rather than an issue with the specific charm.
@barrettj12 it is so strange because we are actually using an action to increase the storage in our GH runners, plus for some folks this has happened in beefier machines. The strangest part is that this seems to resolve after a while in some cases.
Came across this when deploying CKF on microK8s deployed to an AWS EC2 Instance type of t3.2xlarge
using Ubuntu image and having an 80GB volume. This happened for jupyter-controller
charm.
Nothing insteresting in juju debug-log --replay --include application-jupyter-controller
First part of results of kubectl logs jupyter-controller-776fcbf4bc-hh5kf -n kubeflow
. I don't see anything related here as well. Is there a chance this pod got recreated? I ran a juju resolved jupyter-controller-0
before getting the logs.
kubectl describe node ip-172-31-18-179
@NohaIhab this may affect the CKF 1.8 release, we may want to prioritise a fix for the issue.
cc: @kimwnasptd
This error message suggests there is an error when pulling the image from the
registry.jujucharms.com
container registry, but this information has to be confirmed because these are sometimes linked to low storage.
Are we sure this is linked to low storage? From my experience ImagePullBackOff
has happened only when the image that we want to pull doesn't exist at all. If that's the case could it have to do something with
registry.jujucharms.com
when we publish a charm??
Yes, let's prioritise this bug from next pulse. The first thing I would check is to indeed make sure that the ImagePullBackOff
bug doesn't happen due to low storage.
This error message suggests there is an error when pulling the image from the
registry.jujucharms.com
container registry, but this information has to be confirmed because these are sometimes linked to low storage.Are we sure this is linked to low storage? From my experience
ImagePullBackOff
has happened only when the image that we want to pull doesn't exist at all. If that's the case could it have to do something with1. how our GH actions are bundling images to `registry.jujucharms.com` when we publish a charm 2. how our CI is running publishing/deploying the charms to run
??
Yes, let's prioritise this bug from next pulse. The first thing I would check is to indeed make sure that the
ImagePullBackOff
bug doesn't happen due to low storage.
Just FYI, I have submitted an issue to the Snap Store issue tracker (limited access) to see if they can help on the registry side as the message suggests it could be something on that side.
Just another bump. This issue happening constantly in deployment of KF on GCP:
kserve-controller/0* error idle 10.1.45.231 unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/9sv4hhxzp6mbtgn0yn...
kubeflow-profiles/0* error idle 10.1.45.233 unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/4jn95wlrsfo1po9wfu...
katib-controller/0* error idle 10.1.45.223 443/TCP,8080/TCP crash loop backoff: back-off 5m0s restarting failed container=katib-controller pod=katib-controller-596dfcf6f6-6n8b4_...
I saw something interesting in another PR, where I saw ImagePullBackOff
errors
https://github.com/canonical/kfp-operators/actions/runs/5979649899/job/16224169467?pr=269
It happened in KFP API tests which
kfp-api
charm from local charm file, and use the upstream docker image directlykfp-viz
charm from CharmHubminio
charm from CharmHubWhat ended up happening is the kfp-viz
and minio
Charms were initially getting ImagePullBackOff
errors, but then ended up becoming active
.
Specifically these were the relevant (pytest) logs:
INFO juju.model:model.py:2618 Waiting for model:
kfp-api/0 [allocating] waiting: agent initializing
kfp-viz/0 [allocating] waiting: installing agent
kfp-db/0 [allocating] waiting: installing agent
minio/0 [allocating] waiting: installing agent
...
INFO juju.model:model.py:2618 Waiting for model:
kfp-viz/0 [idle] active:
minio/0 [idle] active:
...
INFO juju.model:model.py:2618 Waiting for model:
kfp-viz/0 [idle] error: unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:3de6f3cae504f087ecdae39063d6b0557d8ce20e4948ef0a4f8afba6c3e3cf27"
minio/0 [idle] active:
...
INFO juju.model:model.py:2618 Waiting for model:
kfp-viz/0 [idle] error: unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:3de6f3cae504f087ecdae39063d6b0557d8ce20e4948ef0a4f8afba6c3e3cf27"
minio/0 [idle] error: unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/81j63o4a2ldarn1umc22iyjz1q9l9g0sx5b8j/oci-image@sha256:1755999849a392bdf00b77[87](https://github.com/canonical/kfp-operators/actions/runs/5979649899/job/16224169467?pr=269#step:5:88)05f4cf5c1c971a1ef55a17e9075e56f8d58bdc2f"
...
INFO juju.model:model.py:2618 Waiting for model:
kfp-viz/0 [idle] active:
minio/0 [idle] error: unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/81j63o4a2ldarn1umc22iyjz1q9l9g0sx5b8j/oci-image@sha256:1755999849a392bdf00b778705f4cf5c1c971a1ef55a17e9075e56f8d58bdc2f"
...
INFO juju.model:model.py:2618 Waiting for model:
minio/0 [idle] active:
PASSED
@barrettj12 I think this means that the problem is not disk space, since they end up fetching the images, but rather some other transient error we need to understand. Do you think otherwise? Do you have any suggestions on how we could tackle and debug this further?
According to the snap store team (comment here), the issue "could be related to load spikes on the VMs running the registry." The team has increased the resources on those machines to prevent the issue and we should stop running into it. I'm leaving this issue open until we stop seeing the error, will check back in a week.
According to the snap store team (comment here), the issue "could be related to load spikes on the VMs running the registry." The team has increased the resources on those machines to prevent the issue and we should stop running into it. I'm leaving this issue open until we stop seeing the error, will check back in a week.
The snap store team mentioned this could be a solution, since the issue is not present anymore in the Carmed Kubeflow repos' CI, we can close it. Feel free to re-open if this is still an issue.
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5220.
This message was autogenerated
It seems like there is an intermittent issue when deploying charms from Charmhub in
GH runners, which fail to deploy with the status message:unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/4jn95wlrsfo1po9wfu...
.This error message suggests there is an error when pulling the image from the
registry.jujucharms.com/
container registry, but this information has to be confirmed because these are sometimes linked to low storage.Important to note, this error goes away after re-running the CI, as mentioned before, it seems like an intermittent issue, but has happened consistently across other PRs at least once.
Environment
This has happened in GH runners during integration test execution, but also on machines with more resources and better network settings. The error is not related to one charm in particular, it has happened to different ones (minio, kubeflow-volumes, etc.)
Possible causes
It can be either the registry is not working correctly OR we could be running out of storage in the GH runners.
Examples of CIs where this behaviour has been observed
1) https://github.com/canonical/kfp-operators/actions/runs/5716319711/job/15514631356#step:8:38 2) https://github.com/canonical/kfp-operators/actions/runs/5726957265/job/15518414482#step:8:28 3) https://github.com/canonical/kfp-operators/actions/runs/5730163441/job/15533052295#step:5:77