canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
104 stars 50 forks source link

CI runs fail with charms in error: `ImagePullBackOff": Back-off pulling image "registry.jujucharms.com` #655

Closed DnPlas closed 9 months ago

DnPlas commented 1 year ago

It seems like there is an intermittent issue when deploying charms from Charmhub in GH runners, which fail to deploy with the status message: unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/4jn95wlrsfo1po9wfu....

This error message suggests there is an error when pulling the image from the registry.jujucharms.com/ container registry, but this information has to be confirmed because these are sometimes linked to low storage.

Important to note, this error goes away after re-running the CI, as mentioned before, it seems like an intermittent issue, but has happened consistently across other PRs at least once.

Environment

This has happened in GH runners during integration test execution, but also on machines with more resources and better network settings. The error is not related to one charm in particular, it has happened to different ones (minio, kubeflow-volumes, etc.)

Possible causes

It can be either the registry is not working correctly OR we could be running out of storage in the GH runners.

Examples of CIs where this behaviour has been observed

1) https://github.com/canonical/kfp-operators/actions/runs/5716319711/job/15514631356#step:8:38 2) https://github.com/canonical/kfp-operators/actions/runs/5726957265/job/15518414482#step:8:28 3) https://github.com/canonical/kfp-operators/actions/runs/5730163441/job/15533052295#step:5:77

i-chvets commented 1 year ago

On track/2.0 the following failures observed:

Model Controller Cloud/Region Version SLA Timestamp
test-charm-yjgo github-pr-77e6a-microk8s microk8s/localhost 2.9.44 unsupported 11:52:17Z
App Version Status Scale Charm Channel Rev Address Exposed Message
grafana-k8s 9.2.1 active 1 grafana-k8s latest/stable 81 10.152.183.115 no
kfp-api waiting 1 kfp-api 0 10.152.183.6 no installing agent
kfp-db mariadb/server:10.3 active 1 charmed-osm-mariadb-k8s latest/stable 35 10.152.183.232 no ready
kfp-viz res:oci-image@3de6f3c blocked 0/1 kfp-viz 2.0/stable 476 10.152.183.34 no 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes ...
minio res:oci-image@1755999 blocked 0/1 minio ckf-1.7/stable 186 10.152.183.121 no 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes ...
mysql-k8s 8.0.32-0ubuntu0.22.04.2 waiting 1 mysql-k8s 8.0/edge 88 10.152.183.231 no waiting for units to settle down
prometheus-k8s 2.33.5 active 1 prometheus-k8s latest/stable 103 10.152.183.141 no
prometheus-scrape-config-k8s n/a active 1 prometheus-scrape-config-k8s latest/stable 39 10.152.183.123 no
Unit Workload Agent Address Ports Message
grafana-k8s/0\* active idle 10.1.76.158
kfp-api/0\* blocked idle 10.1.76.137 Please add required database relation: eg. relational-db
kfp-db/0\* active idle 10.1.76.145 3306/TCP ready
kfp-viz/0 unknown lost 10.1.76.146 8888/TCP agent lost, see 'juju show-status-log kfp-viz/0'
minio/0 unknown lost 10.1.76.149 9000/TCP,9001/TCP agent lost, see 'juju show-status-log minio/0'
mysql-k8s/0\* error idle 10.1.76.160 hook failed: "database-relation-changed"
prometheus-k8s/0\* active idle 10.1.76.157
prometheus-scrape-config-k8s/0\* active idle 10.1.76.153
DnPlas commented 1 year ago

Thanks @i-chvets, I was exactly looking for this kind of log minio res:oci-image@1755999 blocked 0/1 minio ckf-1.7/stable 186 10.152.183.121 no 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes .... That is a symptom of the machine running out of resources. We shall investigate a bit more.

ca-scribner commented 1 year ago

Something feels a bit funny here. afaik these VMs should consistently get the same disk space, so it feels odd that we'd have an intermittent disk space error. But I don't have a better idea of what is going on

i-chvets commented 1 year ago

Other components also can experience similar issues:

kfp-viz                       res:oci-image@3de6f3c    blocked    0/1  kfp-viz                       2.0/stable      476  10.152.183.34   no       0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes ...
minio                         res:oci-image@1755999    blocked    0/1  minio                         ckf-1.7/stable  186  10.152.183.121  no       0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes ...
i-chvets commented 1 year ago

Something feels a bit funny here. afaik these VMs should consistently get the same disk space, so it feels odd that we'd have an intermittent disk space error. But I don't have a better idea of what is going on

It is not really a VM disk issue, but storage in MicroK8s.

kimwnasptd commented 1 year ago

but storage in MicroK8s.

I want a cookie for that :)

i-chvets commented 1 year ago

Another random error in KFP:

kfp-profile-controller   python:3.7             error        1  kfp-profile-controller                    0                 no       creating or updating custom resources: getting custom resources: attempt count exceeded: getting custom resource defi...

This is our charm that tries to apply K8S resources and fails after many attempts. I will add more logs, when they are uploaded. test is still going and failing all over the place.

i-chvets commented 1 year ago

Latest in KFP:

minio/0*                 error        idle   10.1.170.223  9000/TCP,9001/TCP  unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/81j63o4a2ldarn1umc...

Corresponding log: https://pastebin.canonical.com/p/MbXsNbz6RY/

DnPlas commented 1 year ago

Interestingly enough, I ran into something very similar today in the dex-auth-operator repo. Look at this CI execution.

orfeas-k commented 1 year ago

I came across this error here with argo-server. Attaching any logs I find related but if you think there are more, let me know. Error:

argo-server/0*           error        idle        10.1.21.142  2746/TCP  unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/hs4bjttgix7e2kf318...

juju status logs-unit-argo-server-0

04 Aug 2023 11:15:15Z  juju-unit  executing    running config-changed hook
04 Aug 2023 11:15:18Z  workload   maintenance  Setting pod spec
04 Aug 2023 11:15:21Z  juju-unit  executing    running start hook
04 Aug 2023 11:15:24Z  juju-unit  idle         
04 Aug 2023 11:17:25Z  juju-unit  error        OCI image pull error: rpc error: code = Unknown desc = failed to pull and unpack image "registry.jujucharms.com/charm/hs4bjttgix7e2kf3188j168vjafe2mfmr8m16/oci-image@sha256:576d03880b4d608b00607902be8f52692e2b8d40f9fdc21992b65447a93614c2": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.jujucharms.com/v2/charm/hs4bjttgix7e2kf3188j168vjafe2mfmr8m16/oci-image/blobs/sha256:164db2618c02555da834fe4893b2922dba49f830ff886a4084ad515a2365a19e: 502 Proxy Error
04 Aug 2023 11:17:25Z  workload   error        OCI image pull error: rpc error: code = Unknown desc = failed to pull and unpack image "registry.jujucharms.com/charm/hs4bjttgix7e2kf3188j168vjafe2mfmr8m16/oci-image@sha256:576d03880b4d608b00607902be8f52692e2b8d40f9fdc21992b65447a93614c2": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.jujucharms.com/v2/charm/hs4bjttgix7e2kf3188j168vjafe2mfmr8m16/oci-image/blobs/sha256:164db2618c02555da834fe4893b2922dba49f830ff886a4084ad515a2365a19e: 502 Proxy Error
04 Aug 2023 11:17:38Z  juju-unit  error        unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/hs4bjttgix7e2kf3188j168vjafe2mfmr8m16/oci-image@sha256:576d03880b4d608b00607902be8f52692e2b8d40f9fdc21992b65447a93614c2"
04 Aug 2023 11:17:38Z  workload   error        unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/hs4bjttgix7e2kf3188j168vjafe2mfmr8m16/oci-image@sha256:576d03880b4d608b00607902be8f52692e2b8d40f9fdc21992b65447a93614c2"

Couldn't find anything that could be useful in juju debug-log but let me know if you 'd like to investigate yourself.

phoevos commented 1 year ago

Just bumped into this for a different charm in a VM of my own running MicroK8s:

juju show-status mlflow-minio
Model     Controller       Cloud/Region        Version  SLA          Timestamp
kubeflow  uk8s-controller  microk8s/localhost  2.9.44   unsupported  08:28:37Z

App           Version                Status   Scale  Charm  Channel       Rev  Address         Exposed  Message
mlflow-minio  res:oci-image@1755999  waiting      1  minio  ckf-1.7/edge  186  10.152.183.144  no       

Unit             Workload  Agent  Address       Ports              Message
mlflow-minio/0*  error     idle   10.1.154.150  9000/TCP,9001/TCP  unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/81j63o4a2ldarn1umc...

So this is neither KFP nor GH runner specific.

barrettj12 commented 1 year ago

Thanks @i-chvets, I was exactly looking for this kind of log minio res:oci-image@1755999 blocked 0/1 minio ckf-1.7/stable 186 10.152.183.121 no 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes .... That is a symptom of the machine running out of resources. We shall investigate a bit more.

I've seen this failure in some of our Juju CI before (just simple bootstrap/smoke tests). Probably something to do with K8s storage on GH runners, rather than an issue with the specific charm.

DnPlas commented 1 year ago

Thanks @i-chvets, I was exactly looking for this kind of log minio res:oci-image@1755999 blocked 0/1 minio ckf-1.7/stable 186 10.152.183.121 no 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes .... That is a symptom of the machine running out of resources. We shall investigate a bit more.

I've seen this failure in some of our Juju CI before (just simple bootstrap/smoke tests). Probably something to do with K8s storage on GH runners, rather than an issue with the specific charm.

@barrettj12 it is so strange because we are actually using an action to increase the storage in our GH runners, plus for some folks this has happened in beefier machines. The strangest part is that this seems to resolve after a while in some cases.

orfeas-k commented 1 year ago

Came across this when deploying CKF on microK8s deployed to an AWS EC2 Instance type of t3.2xlarge using Ubuntu image and having an 80GB volume. This happened for jupyter-controller charm.

Nothing insteresting in juju debug-log --replay --include application-jupyter-controller

Logs ``` application-jupyter-controller: 08:03:40 INFO juju.cmd running jujud [2.9.44 9b1d995577fda4545f9d3f227633eaa3948af7ea gc go1.20.5] application-jupyter-controller: 08:03:40 DEBUG juju.cmd args: []string{"/var/lib/juju/tools/jujud", "caasoperator", "--application-name=jupyter-controller", "--debug"} application-jupyter-controller: 08:03:40 DEBUG juju.agent read agent config, format "2.0" application-jupyter-controller: 08:03:40 INFO juju.worker.upgradesteps upgrade steps for 2.9.44 have already been run. application-jupyter-controller: 08:03:40 INFO juju.cmd.jujud caas operator application-jupyter-controller start (2.9.44 [gc]) application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "caas-units-manager" manifold worker started at 2023-08-08 08:03:40.752105275 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "agent" manifold worker started at 2023-08-08 08:03:40.752150301 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "caas-units-manager" manifold worker completed successfully application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "clock" manifold worker started at 2023-08-08 08:03:40.752323674 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.worker.introspection introspection worker listening on "@jujud-application-jupyter-controller" application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "api-config-watcher" manifold worker started at 2023-08-08 08:03:40.75281497 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "upgrade-steps-gate" manifold worker started at 2023-08-08 08:03:40.752843541 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.worker.introspection stats worker now serving application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "caas-units-manager" manifold worker started at 2023-08-08 08:03:40.762761058 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.worker.apicaller connecting with old password application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "upgrade-steps-flag" manifold worker started at 2023-08-08 08:03:40.76417535 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.api looked up controller-service.controller-microk8s-localhost.svc.cluster.local -> [10.152.183.99] application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "migration-fortress" manifold worker started at 2023-08-08 08:03:40.775571726 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.api successfully dialed "wss://controller-service.controller-microk8s-localhost.svc.cluster.local:17070/model/bb9a51f1-9044-49dc-8174-307e1943c792/api" application-jupyter-controller: 08:03:40 INFO juju.api connection established to "wss://controller-service.controller-microk8s-localhost.svc.cluster.local:17070/model/bb9a51f1-9044-49dc-8174-307e1943c792/api" application-jupyter-controller: 08:03:40 INFO juju.worker.apicaller [bb9a51] "application-jupyter-controller" successfully connected to "controller-service.controller-microk8s-localhost.svc.cluster.local:17070" application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "api-caller" manifold worker started at 2023-08-08 08:03:40.781765193 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "caas-units-manager" manifold worker completed successfully application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "caas-units-manager" manifold worker started at 2023-08-08 08:03:40.79101777 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "upgrade-steps-runner" manifold worker started at 2023-08-08 08:03:40.791113897 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "upgrade-steps-runner" manifold worker completed successfully application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "migration-minion" manifold worker started at 2023-08-08 08:03:40.793410802 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "log-sender" manifold worker started at 2023-08-08 08:03:40.793487571 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "upgrader" manifold worker started at 2023-08-08 08:03:40.793555676 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "migration-inactive-flag" manifold worker started at 2023-08-08 08:03:40.794859219 +0000 UTC application-jupyter-controller: 08:03:40 INFO juju.worker.migrationminion migration phase is now: NONE application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "charm-dir" manifold worker started at 2023-08-08 08:03:40.806019266 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "proxy-config-updater" manifold worker started at 2023-08-08 08:03:40.806143122 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "api-address-updater" manifold worker started at 2023-08-08 08:03:40.806223093 +0000 UTC application-jupyter-controller: 08:03:40 DEBUG juju.worker.logger initial log config: "=DEBUG" application-jupyter-controller: 08:03:40 DEBUG juju.worker.dependency "logging-config-updater" manifold worker started at 2023-08-08 08:03:40.80632682 +0000 UTC application-jupyter-controller: 08:03:40 INFO juju.worker.logger logger worker started application-jupyter-controller: 08:03:40 INFO juju.worker.caasupgrader abort check blocked until version event received application-jupyter-controller: 08:03:40 DEBUG juju.worker.caasupgrader current agent binary version: 2.9.44 application-jupyter-controller: 08:03:40 INFO juju.worker.caasupgrader unblocking abort check application-jupyter-controller: 08:03:40 DEBUG juju.worker.logger reconfiguring logging from "=DEBUG" to "=INFO" application-jupyter-controller: 08:03:40 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: "" application-jupyter-controller: 08:03:40 INFO juju.worker.caasoperator.charm downloading ch:amd64/focal/jupyter-controller-607 from API server application-jupyter-controller: 08:03:40 INFO juju.downloader downloading from ch:amd64/focal/jupyter-controller-607 application-jupyter-controller: 08:03:41 INFO juju.downloader download complete ("ch:amd64/focal/jupyter-controller-607") application-jupyter-controller: 08:03:41 INFO juju.downloader download verified ("ch:amd64/focal/jupyter-controller-607") application-jupyter-controller: 08:04:06 INFO juju.worker.caasoperator operator "jupyter-controller" started application-jupyter-controller: 08:04:06 INFO juju.worker.caasoperator.runner start "jupyter-controller/0" application-jupyter-controller: 08:04:06 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-jupyter-controller-0 application-jupyter-controller: 08:04:06 INFO juju.worker.leadership jupyter-controller/0 promoted to leadership of jupyter-controller application-jupyter-controller: 08:04:06 INFO juju.worker.caasoperator.uniter.jupyter-controller/0 unit "jupyter-controller/0" started application-jupyter-controller: 08:04:06 INFO juju.worker.caasoperator.uniter.jupyter-controller/0 resuming charm install application-jupyter-controller: 08:04:06 INFO juju.worker.caasoperator.uniter.jupyter-controller/0.charm downloading ch:amd64/focal/jupyter-controller-607 from API server application-jupyter-controller: 08:04:06 INFO juju.downloader downloading from ch:amd64/focal/jupyter-controller-607 application-jupyter-controller: 08:04:06 INFO juju.downloader download complete ("ch:amd64/focal/jupyter-controller-607") application-jupyter-controller: 08:04:06 INFO juju.downloader download verified ("ch:amd64/focal/jupyter-controller-607") application-jupyter-controller: 08:04:37 INFO juju.worker.caasoperator.uniter.jupyter-controller/0 hooks are retried true application-jupyter-controller: 08:04:37 INFO juju.worker.caasoperator.uniter.jupyter-controller/0 found queued "install" hook application-jupyter-controller: 08:04:38 INFO unit.jupyter-controller/0.juju-log Running legacy hooks/install. application-jupyter-controller: 08:04:39 WARNING unit.jupyter-controller/0.juju-log 0 containers are present in metadata.yaml and refresh_event was not specified. Defaulting to update_status. Metrics IP may not be set in a timely fashion. ```

First part of results of kubectl logs jupyter-controller-776fcbf4bc-hh5kf -n kubeflow. I don't see anything related here as well. Is there a chance this pod got recreated? I ran a juju resolved jupyter-controller-0 before getting the logs.

``` Defaulted container "jupyter-controller" out of: jupyter-controller, juju-pod-init (init) I0808 08:37:01.614048 1 request.go:665] Waited for 1.038423544s due to client-side throttling, not priority and fairness, request: GET:https://10.152.183.1:443/apis/install.istio.io/v1alpha1?timeout=32s 1.6914838221678545e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"} 1.6914838221683755e+09 INFO setup starting manager 1.6914838221701238e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"} 1.6914838221701345e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"} 1.6914838221702387e+09 INFO controller.notebook Starting EventSource {"reconciler group": "kubeflow.org", "reconciler kind": "Notebook", "source": "kind source: *v1beta1.Notebook"} 1.6914838221702564e+09 INFO controller.Culler Starting EventSource {"reconciler group": "kubeflow.org", "reconciler kind": "Notebook", "source": "kind source: *v1beta1.Notebook"} 1.691483822170275e+09 INFO controller.notebook Starting EventSource {"reconciler group": "kubeflow.org", "reconciler kind": "Notebook", "source": "kind source: *v1.StatefulSet"} 1.6914838221702797e+09 INFO controller.Culler Starting Controller {"reconciler group": "kubeflow.org", "reconciler kind": "Notebook"} 1.6914838221702847e+09 INFO controller.notebook Starting EventSource {"reconciler group": "kubeflow.org", "reconciler kind": "Notebook", "source": "kind source: *v1.Service"} 1.6914838221702945e+09 INFO controller.notebook Starting EventSource {"reconciler group": "kubeflow.org", "reconciler kind": "Notebook", "source": "kind source: *unstructured.Unstructured"} 1.691483822170307e+09 INFO controller.notebook Starting EventSource {"reconciler group": "kubeflow.org", "reconciler kind": "Notebook", "source": "kind source: *v1.Pod"} 1.6914838221703138e+09 INFO controller.notebook Starting EventSource {"reconciler group": "kubeflow.org", "reconciler kind": "Notebook", "source": "kind source: *v1.Event"} 1.691483822170319e+09 INFO controller.notebook Starting Controller {"reconciler group": "kubeflow.org", "reconciler kind": "Notebook"} 1.6914838222729275e+09 INFO controller.Culler Starting workers {"reconciler group": "kubeflow.org", "reconciler kind": "Notebook", "worker count": 1} 1.691483822273092e+09 INFO controller.notebook Starting workers {"reconciler group": "kubeflow.org", "reconciler kind": "Notebook", "worker count": 1} ```

kubectl describe node ip-172-31-18-179

``` Name: ip-172-31-18-179 Roles: Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-172-31-18-179 kubernetes.io/os=linux microk8s.io/cluster=true node.kubernetes.io/microk8s-controlplane=microk8s-controlplane Annotations: node.alpha.kubernetes.io/ttl: 0 projectcalico.org/IPv4Address: 172.31.18.179/20 projectcalico.org/IPv4VXLANTunnelAddr: 10.1.224.0 volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Tue, 08 Aug 2023 07:58:25 +0000 Taints: Unschedulable: false Lease: HolderIdentity: ip-172-31-18-179 AcquireTime: RenewTime: Tue, 08 Aug 2023 09:20:26 +0000 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- NetworkUnavailable False Tue, 08 Aug 2023 07:59:20 +0000 Tue, 08 Aug 2023 07:59:20 +0000 CalicoIsUp Calico is running on this node MemoryPressure False Tue, 08 Aug 2023 09:16:51 +0000 Tue, 08 Aug 2023 07:58:25 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 08 Aug 2023 09:16:51 +0000 Tue, 08 Aug 2023 07:58:25 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 08 Aug 2023 09:16:51 +0000 Tue, 08 Aug 2023 07:58:25 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 08 Aug 2023 09:16:51 +0000 Tue, 08 Aug 2023 07:58:57 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled Addresses: InternalIP: 172.31.18.179 Hostname: ip-172-31-18-179 Capacity: cpu: 8 ephemeral-storage: 81106868Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32512104Ki pods: 110 Allocatable: cpu: 8 ephemeral-storage: 80058292Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32409704Ki pods: 110 System Info: Machine ID: ec20d13942c55ebeb59a971d6829ba1e System UUID: ec20d139-42c5-5ebe-b59a-971d6829ba1e Boot ID: 7aa6f4de-bff9-4c9b-8e9d-553aff575f4d Kernel Version: 5.19.0-1025-aws OS Image: Ubuntu 22.04.2 LTS Operating System: linux Architecture: amd64 Container Runtime Version: containerd://1.5.13 Kubelet Version: v1.24.16-2+40f0d5a1bef7b4 Kube-Proxy Version: v1.24.16-2+40f0d5a1bef7b4 Non-terminated Pods: (83 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- kube-system calico-node-flmn4 250m (3%) 0 (0%) 0 (0%) 0 (0%) 81m kube-system calico-kube-controllers-74959db457-2mnhq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 81m kube-system coredns-66bcf65bb8-mr7mg 100m (1%) 0 (0%) 70Mi (0%) 170Mi (0%) 81m metallb-system controller-5d468955f-r2tqg 100m (1%) 100m (1%) 100Mi (0%) 100Mi (0%) 80m metallb-system speaker-rzdhj 100m (1%) 100m (1%) 100Mi (0%) 100Mi (0%) 80m ingress nginx-ingress-microk8s-controller-mlv4h 0 (0%) 0 (0%) 0 (0%) 0 (0%) 80m kube-system hostpath-provisioner-78cb89d65b-g2h8f 0 (0%) 0 (0%) 0 (0%) 0 (0%) 80m controller-microk8s-localhost controller-0 0 (0%) 0 (0%) 3Gi (9%) 3Gi (9%) 78m controller-microk8s-localhost modeloperator-6645d9855d-5pknp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 78m kubeflow modeloperator-6cd9dccc8d-2vrwt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 78m kubeflow admission-webhook-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 77m kubeflow argo-controller-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 77m kubeflow argo-server-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 77m kubeflow jupyter-controller-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 76m kubeflow katib-controller-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 76m kubeflow istio-ingressgateway-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 77m kubeflow admission-webhook-5dbbc5587f-tv7rp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 76m kubeflow dex-auth-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 77m kubeflow istiod-58687f9d68-5ls6r 500m (6%) 0 (0%) 2Gi (6%) 0 (0%) 75m kubeflow jupyter-ui-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 76m kubeflow katib-db-manager-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 76m kubeflow istio-pilot-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 77m kubeflow katib-ui-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 75m kubeflow argo-server-794d85d985-lgddb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 76m kubeflow knative-eventing-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 75m kubeflow kubeflow-roles-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 74m kubeflow kfp-viewer-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 73m kubeflow knative-serving-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 74m kubeflow kfp-viz-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 72m kubeflow kfp-profile-controller-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 72m kubeflow kfp-schedwf-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 72m kubeflow kfp-ui-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 72m kubeflow kfp-persistence-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71m kubeflow oidc-gatekeeper-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71m kubeflow tensorboards-web-app-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71m kubeflow kubeflow-volumes-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71m kubeflow minio-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71m kubeflow tensorboard-controller-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71m kubeflow metacontroller-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 74m kubeflow istio-ingressgateway-workload-5dcdfb989-92qd8 10m (0%) 2 (25%) 40Mi (0%) 1Gi (3%) 75m kubeflow katib-db-0 0 (0%) 0 (0%) 4Gi (12%) 4Gi (12%) 76m kubeflow kfp-db-0 0 (0%) 0 (0%) 4Gi (12%) 4Gi (12%) 75m kubeflow katib-controller-788c9557d8-xfn6n 0 (0%) 0 (0%) 0 (0%) 0 (0%) 74m kubeflow kubeflow-dashboard-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 74m kubeflow seldon-controller-manager-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 73m kubeflow metacontroller-operator-charm-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 73m kubeflow training-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 73m kubeflow kfp-viewer-57d8c7ff85-4fvkw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 72m kubeflow kfp-api-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 75m kubeflow kfp-viz-674f44759f-7kn4j 0 (0%) 0 (0%) 0 (0%) 0 (0%) 72m kubeflow kfp-schedwf-846dcf5597-9fjt6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71m kubeflow tensorboards-web-app-75fd7f7844-7ckz5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 70m kubeflow kubeflow-volumes-6d6d4df675-bvq77 0 (0%) 0 (0%) 0 (0%) 0 (0%) 70m kubeflow tensorboard-controller-74cc66698c-2qrhp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 70m kubeflow minio-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 70m kubeflow argo-controller-7bb96878f8-jh5h9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 69m kubeflow kfp-profile-controller-6fbc6d4569-k2gw8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 69m kubeflow jupyter-controller-776fcbf4bc-hh5kf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 75m knative-serving autoscaler-bc7d6c9c9-87k8f 100m (1%) 1 (12%) 100Mi (0%) 1000Mi (3%) 67m knative-serving activator-5f6b4bf5c8-f2j6z 300m (3%) 1 (12%) 60Mi (0%) 600Mi (1%) 67m knative-serving controller-687d88ff56-t4n6g 100m (1%) 1 (12%) 100Mi (0%) 1000Mi (3%) 67m knative-serving domain-mapping-69cc86d8d5-lfq5g 30m (0%) 300m (3%) 40Mi (0%) 400Mi (1%) 67m knative-serving domainmapping-webhook-65dfdd9b96-27kwq 100m (1%) 500m (6%) 100Mi (0%) 500Mi (1%) 67m knative-serving webhook-587cdd8dd7-k5z5p 100m (1%) 500m (6%) 100Mi (0%) 500Mi (1%) 67m knative-serving autoscaler-hpa-6469fbb6cd-cfzkt 30m (0%) 300m (3%) 40Mi (0%) 400Mi (1%) 67m kubeflow knative-operator-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 74m knative-eventing eventing-webhook-7d5b577c94-tjfnd 100m (1%) 200m (2%) 50Mi (0%) 200Mi (0%) 66m knative-serving net-istio-controller-5fc4cc65f7-k6h2s 30m (0%) 300m (3%) 40Mi (0%) 400Mi (1%) 66m knative-serving net-istio-webhook-6c5b7cbdd5-9bggq 20m (0%) 200m (2%) 20Mi (0%) 200Mi (0%) 66m knative-eventing imc-controller-769d8b7f66-d6v9b 0 (0%) 0 (0%) 0 (0%) 0 (0%) 66m knative-eventing imc-dispatcher-55979cf74b-z68h9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 66m knative-eventing mt-broker-filter-56b5d6d697-2dmzp 100m (1%) 0 (0%) 100Mi (0%) 0 (0%) 66m knative-eventing mt-broker-ingress-5c4d45dfd6-5fhj6 100m (1%) 0 (0%) 100Mi (0%) 0 (0%) 66m knative-eventing mt-broker-controller-66b756f8bb-z5kjv 100m (1%) 0 (0%) 100Mi (0%) 0 (0%) 66m knative-eventing eventing-controller-7f448655c8-7d9fr 100m (1%) 0 (0%) 100Mi (0%) 0 (0%) 67m kubeflow kfp-persistence-54b57dc756-chzg8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 62m kubeflow kserve-controller-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 74m kubeflow kubeflow-profiles-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 74m kubeflow kfp-ui-6d79ddcf79-bws8p 0 (0%) 0 (0%) 0 (0%) 0 (0%) 62m kubeflow oidc-gatekeeper-697dc55959-bz2dq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 38m kubeflow-user ml-pipeline-ui-artifact-5675b8f595-8b9d6 110m (1%) 2100m (26%) 198Mi (0%) 1524Mi (4%) 36m kubeflow-user ml-pipeline-visualizationserver-5568776585-9bxc5 150m (1%) 2500m (31%) 328Mi (1%) 2Gi (6%) 36m kubeflow-user vscode-0 1100m (13%) 3200m (40%) 8320Mi (26%) 11381663334400m (34%) 35m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 3730m (46%) 15300m (191%) memory 23518Mi (74%) 33852647014400m (102%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) Events: ```
DnPlas commented 1 year ago

@NohaIhab this may affect the CKF 1.8 release, we may want to prioritise a fix for the issue.

cc: @kimwnasptd

kimwnasptd commented 1 year ago

This error message suggests there is an error when pulling the image from the registry.jujucharms.com container registry, but this information has to be confirmed because these are sometimes linked to low storage.

Are we sure this is linked to low storage? From my experience ImagePullBackOff has happened only when the image that we want to pull doesn't exist at all. If that's the case could it have to do something with

  1. how our GH actions are bundling images to registry.jujucharms.com when we publish a charm
  2. how our CI is running publishing/deploying the charms to run

??

Yes, let's prioritise this bug from next pulse. The first thing I would check is to indeed make sure that the ImagePullBackOff bug doesn't happen due to low storage.

DnPlas commented 1 year ago

This error message suggests there is an error when pulling the image from the registry.jujucharms.com container registry, but this information has to be confirmed because these are sometimes linked to low storage.

Are we sure this is linked to low storage? From my experience ImagePullBackOff has happened only when the image that we want to pull doesn't exist at all. If that's the case could it have to do something with

1. how our GH actions are bundling images to `registry.jujucharms.com` when we publish a charm

2. how our CI is running publishing/deploying the charms to run

??

Yes, let's prioritise this bug from next pulse. The first thing I would check is to indeed make sure that the ImagePullBackOff bug doesn't happen due to low storage.

Just FYI, I have submitted an issue to the Snap Store issue tracker (limited access) to see if they can help on the registry side as the message suggests it could be something on that side.

i-chvets commented 1 year ago

Just another bump. This issue happening constantly in deployment of KF on GCP:

kserve-controller/0*          error        idle   10.1.45.231                     unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/9sv4hhxzp6mbtgn0yn...
kubeflow-profiles/0*          error        idle   10.1.45.233                     unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/4jn95wlrsfo1po9wfu...
katib-controller/0*           error        idle   10.1.45.223  443/TCP,8080/TCP   crash loop backoff: back-off 5m0s restarting failed container=katib-controller pod=katib-controller-596dfcf6f6-6n8b4_...
kimwnasptd commented 1 year ago

I saw something interesting in another PR, where I saw ImagePullBackOff errors https://github.com/canonical/kfp-operators/actions/runs/5979649899/job/16224169467?pr=269

It happened in KFP API tests which

What ended up happening is the kfp-viz and minio Charms were initially getting ImagePullBackOff errors, but then ended up becoming active.

Specifically these were the relevant (pytest) logs:

INFO     juju.model:model.py:2618 Waiting for model:
  kfp-api/0 [allocating] waiting: agent initializing
  kfp-viz/0 [allocating] waiting: installing agent
  kfp-db/0 [allocating] waiting: installing agent
  minio/0 [allocating] waiting: installing agent
...
INFO     juju.model:model.py:2618 Waiting for model:
  kfp-viz/0 [idle] active: 
  minio/0 [idle] active: 
...
INFO     juju.model:model.py:2618 Waiting for model:
  kfp-viz/0 [idle] error: unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:3de6f3cae504f087ecdae39063d6b0557d8ce20e4948ef0a4f8afba6c3e3cf27"
  minio/0 [idle] active: 
...
INFO     juju.model:model.py:2618 Waiting for model:
  kfp-viz/0 [idle] error: unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/c2o31yht1y825t6n49mwko4wyel0rracnrjn5/oci-image@sha256:3de6f3cae504f087ecdae39063d6b0557d8ce20e4948ef0a4f8afba6c3e3cf27"
  minio/0 [idle] error: unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/81j63o4a2ldarn1umc22iyjz1q9l9g0sx5b8j/oci-image@sha256:1755999849a392bdf00b77[87](https://github.com/canonical/kfp-operators/actions/runs/5979649899/job/16224169467?pr=269#step:5:88)05f4cf5c1c971a1ef55a17e9075e56f8d58bdc2f"
...
INFO     juju.model:model.py:2618 Waiting for model:
  kfp-viz/0 [idle] active: 
  minio/0 [idle] error: unknown container reason "ImagePullBackOff": Back-off pulling image "registry.jujucharms.com/charm/81j63o4a2ldarn1umc22iyjz1q9l9g0sx5b8j/oci-image@sha256:1755999849a392bdf00b778705f4cf5c1c971a1ef55a17e9075e56f8d58bdc2f"
...
INFO     juju.model:model.py:2618 Waiting for model:
  minio/0 [idle] active: 
PASSED

@barrettj12 I think this means that the problem is not disk space, since they end up fetching the images, but rather some other transient error we need to understand. Do you think otherwise? Do you have any suggestions on how we could tackle and debug this further?

DnPlas commented 1 year ago

According to the snap store team (comment here), the issue "could be related to load spikes on the VMs running the registry." The team has increased the resources on those machines to prevent the issue and we should stop running into it. I'm leaving this issue open until we stop seeing the error, will check back in a week.

DnPlas commented 9 months ago

According to the snap store team (comment here), the issue "could be related to load spikes on the VMs running the registry." The team has increased the resources on those machines to prevent the issue and we should stop running into it. I'm leaving this issue open until we stop seeing the error, will check back in a week.

The snap store team mentioned this could be a solution, since the issue is not present anymore in the Carmed Kubeflow repos' CI, we can close it. Feel free to re-open if this is still an issue.

syncronize-issues-to-jira[bot] commented 9 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5220.

This message was autogenerated