canonical / kfp-operators

Kubeflow Pipelines Operators
Apache License 2.0
2 stars 12 forks source link

`kfp-v2` bundle tests fail during recurring run #602

Closed orfeas-k closed 3 days ago

orfeas-k commented 1 week ago

Bug Description

Intermittently with #601, tests fail during test_create_and_monitor_recurring_run:

AssertionError: assert 'FAILED' == 'SUCCEEDED'

  - SUCCEEDED
  + FAILED

Looking at the juju status, it looks like node could be potentially out of space

Model     Controller                Cloud/Region        Version  SLA          Timestamp
kubeflow  github-pr-38fa2-microk8s  microk8s/localhost  3.4.6    unsupported  14:34:1[7](https://github.com/canonical/kfp-operators/actions/runs/11838249242/job/32986966903?pr=583#step:7:8)Z

App                      Version                  Status   Scale  Charm                    Channel       Rev  Address         Exposed  Message
argo-controller                                   active       1  argo-controller          latest/edge   596  10.152.1[8](https://github.com/canonical/kfp-operators/actions/runs/11838249242/job/32986966903?pr=583#step:7:9)3.140  no       
envoy                                             active       1  envoy                    latest/edge   307  10.152.183.161  no       
istio-ingressgateway                              active       1  istio-gateway            latest/edge  1287  10.152.183.151  no       
istio-pilot                                       active       1  istio-pilot              latest/edge  1235  10.152.183.45   no       
kfp-api                                           active       1  kfp-api                                  0  10.152.183.241  no       
kfp-db                   8.0.37-0ubuntu0.22.04.3  active       1  mysql-k8s                8.0/stable    180  10.152.183.8[9](https://github.com/canonical/kfp-operators/actions/runs/11838249242/job/32986966903?pr=583#step:7:10)   no       
kfp-metadata-writer                               active       1  kfp-metadata-writer                      0  [10](https://github.com/canonical/kfp-operators/actions/runs/11838249242/job/32986966903?pr=583#step:7:11).152.183.114  no       
kfp-persistence                                   active       1  kfp-persistence                          0  10.152.183.247  no       
kfp-profile-controller                            active       1  kfp-profile-controller                   0  10.152.183.139  no       
kfp-schedwf                                       active       1  kfp-schedwf                              0  10.152.183.214  no       
kfp-ui                                            active       1  kfp-ui                                   0  10.152.183.248  no       
kfp-viewer                                        active       1  kfp-viewer                               0  10.152.183.170  no       
kfp-viz                                           active       1  kfp-viz                                  0  10.152.183.20   no       
kubeflow-profiles                                 active       1  kubeflow-profiles        latest/edge   456  10.152.183.58   no       
kubeflow-roles                                    active       1  kubeflow-roles           latest/edge   264  10.152.183.57   no       
metacontroller-operator                           active       1  metacontroller-operator  latest/edge   349  10.152.183.185  no       
minio                    res:oci-image@220b31a    blocked    0/1  minio                    latest/edge   376  10.152.183.226  no       0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes ...
mlmd                                              active       1  mlmd                     latest/edge   243  10.152.183.225  no       

Unit                        Workload  Agent  Address      Ports          Message
argo-controller/0*          active    idle   10.1.2[11](https://github.com/canonical/kfp-operators/actions/runs/11838249242/job/32986966903?pr=583#step:7:12).74                 
envoy/0*                    active    idle   10.1.211.75                 
istio-ingressgateway/0*     active    idle   10.1.211.76                 
istio-pilot/0*              active    idle   10.1.211.77                 
kfp-api/0*                  active    idle   10.1.211.78                 
kfp-db/0*                   active    idle   10.1.211.82                 Primary
kfp-metadata-writer/0*      active    idle   10.1.211.81                 
kfp-persistence/0*          active    idle   10.1.211.84                 
kfp-profile-controller/0*   active    idle   10.1.211.85                 
kfp-schedwf/0*              active    idle   10.1.[21](https://github.com/canonical/kfp-operators/actions/runs/11838249242/job/32986966903?pr=583#step:7:22)1.86                 
kfp-ui/0*                   active    idle   10.1.211.88                 
kfp-viewer/0*               active    idle   10.1.211.89                 
kfp-viz/0*                  active    idle   10.1.211.90                 
kubeflow-profiles/0*        active    idle   10.1.211.93                 
kubeflow-roles/0*           active    idle   10.1.211.91                 
metacontroller-operator/0*  active    idle   10.1.211.92                 
minio/0*                    unknown   lost   10.1.211.97  9000-9001/TCP  agent lost, see 'juju show-status-log minio/0'
mlmd/0*                     active    idle   10.1.211.95

To Reproduce

Rerun CI from PR https://github.com/canonical/kfp-operators/pull/583

Environment

Juju 3.4.6 Microk8s 1.29

Relevant Log Output

____________________ test_create_and_monitor_recurring_run _____________________
Traceback (most recent call last):
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/runner.py", line 341, in from_call
    result: TResult | None = func()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/runner.py", line 242, in <lambda>
    lambda: runtest_hook(item=item, **kwds), when=when, reraise=reraise
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 139, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 122, in _multicall
    teardown.throw(exception)  # type: ignore[union-attr]
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/threadexception.py", line 92, in pytest_runtest_call
    yield from thread_exception_runtest_hook()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/threadexception.py", line 68, in thread_exception_runtest_hook
    yield
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 122, in _multicall
    teardown.throw(exception)  # type: ignore[union-attr]
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/unraisableexception.py", line 95, in pytest_runtest_call
    yield from unraisable_exception_runtest_hook()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/unraisableexception.py", line 70, in unraisable_exception_runtest_hook
    yield
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 122, in _multicall
    teardown.throw(exception)  # type: ignore[union-attr]
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/logging.py", line 848, in pytest_runtest_call
    yield from self._runtest_for(item, "call")
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/logging.py", line 831, in _runtest_for
    yield
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 122, in _multicall
    teardown.throw(exception)  # type: ignore[union-attr]
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/capture.py", line 879, in pytest_runtest_call
    return (yield)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 122, in _multicall
    teardown.throw(exception)  # type: ignore[union-attr]
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/skipping.py", line 257, in pytest_runtest_call
    return (yield)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/runner.py", line 174, in pytest_runtest_call
    item.runtest()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/python.py", line 1627, in runtest
    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 182, in _multicall
    return outcome.get_result()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_result.py", line 100, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/python.py", line 159, in pytest_pyfunc_call
    result = testfunction(**testargs)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pytest_asyncio/plugin.py", line 529, in inner
    _loop.run_until_complete(task)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/home/runner/work/kfp-operators/kfp-operators/tests/integration/test_kfp_functional_v2.py", line 204, in test_create_and_monitor_recurring_run
    assert monitor_response.state == "SUCCEEDED"
AssertionError: assert 'FAILED' == 'SUCCEEDED'

  - SUCCEEDED
  + FAILED

Additional Context

No response

syncronize-issues-to-jira[bot] commented 1 week ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6575.

This message was autogenerated

orfeas-k commented 5 days ago

Adding a df -h before every test showed that this is a disk size issue:

# df -h before test_create_and_monitor_recurring_run
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   72G  1.4G  99% /
devtmpfs        7.9G     0  7.9G   0% /dev
tmpfs           7.9G  4.0K  7.9G   1% /dev/shm
tmpfs           1.6G  3.2M  1.6G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/sdb15      105M  6.1M   99M   6% /boot/efi
/dev/loop0       64M   64M     0 100% /snap/core20/2379
/dev/loop1       39M   39M     0 100% /snap/snapd/21759
/dev/loop2       92M   92M     0 100% /snap/lxd/29619
/dev/sda1        74G   28K   70G   1% /mnt
tmpfs           1.6G  4.0K  1.6G   1% /run/user/1001
/dev/loop3      105M  105M     0 100% /snap/core/17[200](https://github.com/canonical/kfp-operators/actions/runs/11909109360/job/33193929888#step:6:201)
/dev/loop4       74M   74M     0 100% /snap/core22/1663
/dev/loop5      105M  105M     0 100% /snap/lxd/30130
tmpfs           1.0M     0  1.0M   0% /var/snap/lxd/common/ns
/dev/loop6       95M   95M     0 100% /snap/juju/28491
/dev/loop7       28M   28M     0 100% /snap/charm/712
/dev/loop8       59M   59M     0 100% /snap/charmcraft/4914
/dev/loop9      256K  256K     0 100% /snap/jq/6
/dev/loop10     1.5M  1.5M     0 100% /snap/juju-bundle/25
/dev/loop11      13M   13M     0 100% /snap/juju-crashdump/271
/dev/loop12     163M  163M     0 100% /snap/microk8s/7396
/dev/loop13      13M   13M     0 100% /snap/kubectl/3446

This explains also why minio unit is on blocked after tests complete

minio/0*                    blocked      idle                   9000-9001/TCP  0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes ..

with its pod being pending and the node having a NoSchedule taint due to disk-pressure.

node yaml ``` apiVersion: v1 kind: Node metadata: annotations: node.alpha.kubernetes.io/ttl: "0" projectcalico.org/IPv4Address: 10.134.241.1/24 projectcalico.org/IPv4VXLANTunnelAddr: 10.1.84.64 volumes.kubernetes.io/controller-managed-attach-detach: "true" creationTimestamp: "2024-11-20T07:48:12Z" labels: beta.kubernetes.io/arch: amd64 beta.kubernetes.io/os: linux kubernetes.io/arch: amd64 kubernetes.io/hostname: fv-az1423-406 kubernetes.io/os: linux microk8s.io/cluster: "true" node.kubernetes.io/microk8s-controlplane: microk8s-controlplane name: fv-az1423-406 resourceVersion: "10757" uid: b2941984-057f-4273-9150-698e1a075152 spec: taints: - effect: NoSchedule key: node.kubernetes.io/disk-pressure timeAdded: "2024-11-20T08:48:26Z" status: addresses: - address: 10.1.0.22 type: InternalIP - address: fv-az1423-406 type: Hostname allocatable: cpu: "4" ephemeral-storage: 74978040Ki hugepages-1Gi: "0" hugepages-2Mi: "0" memory: 16272920Ki pods: "110" capacity: cpu: "4" ephemeral-storage: 76026616Ki hugepages-1Gi: "0" hugepages-2Mi: "0" memory: 16375320Ki pods: "110" conditions: - lastHeartbeatTime: "2024-11-20T07:48:33Z" lastTransitionTime: "2024-11-20T07:48:33Z" message: Calico is running on this node reason: CalicoIsUp status: "False" type: NetworkUnavailable - lastHeartbeatTime: "2024-11-20T08:48:26Z" lastTransitionTime: "2024-11-20T07:48:12Z" message: kubelet has sufficient memory available reason: KubeletHasSufficientMemory status: "False" type: MemoryPressure - lastHeartbeatTime: "2024-11-20T08:48:26Z" lastTransitionTime: "2024-11-20T08:48:26Z" message: kubelet has disk pressure reason: KubeletHasDiskPressure status: "True" type: DiskPressure - lastHeartbeatTime: "2024-11-20T08:48:26Z" lastTransitionTime: "2024-11-20T07:48:12Z" message: kubelet has sufficient PID available reason: KubeletHasSufficientPID status: "False" type: PIDPressure - lastHeartbeatTime: "2024-11-20T08:48:26Z" lastTransitionTime: "2024-11-20T07:48:24Z" message: kubelet is posting ready status. AppArmor enabled reason: KubeletReady status: "True" type: Ready daemonEndpoints: kubeletEndpoint: Port: 10250 images: - names: - gcr.io/ml-pipeline/visualization-server@sha256:33028a61d49d37fb22a43362cd376c45062084e453094d784017cd9186057699 - gcr.io/ml-pipeline/visualization-server:2.0.3 sizeBytes: 1659092104 - names: - gcr.io/ml-pipeline/visualization-server@sha256:6b9ce7e205ee6f2438f96b7bcaa5b0737a66172585d33b92d867f8e79fbe3da5 - gcr.io/ml-pipeline/visualization-server:2.3.0 sizeBytes: 1558946485 - names: - gcr.io/ml-pipeline/metadata-writer@sha256:9bcfd2abc361bd98a40801d17cfbbfa123bec1082d3cecf5ccda817e2182c565 - gcr.io/ml-pipeline/metadata-writer:2.3.0 sizeBytes: 415224519 - names: - docker.io/library/python@sha256:eedf63967cdb57d8214db38ce21f105003ed4e4d0358f02bedc057341bcf92a0 - docker.io/library/python:3.7 sizeBytes: 372694260 - names: - registry.jujucharms.com/charm/81j63o4a2ldarn1umc22iyjz1q9l9g0sx5b8j/oci-image@sha256:220b31a68d3264f53a746a364207f28868887a7c62c61cc650fd52d8e557641a sizeBytes: 262779637 - names: - registry.jujucharms.com/charm/c3jh1tja0aj8qdj21dodkderf10kmkbeboth3/oci-image@sha256:887c7ccec04559d7d9187c46640a458a57280a4f396ca802a8d24d261fd023d7 sizeBytes: 251745522 - names: - registry.jujucharms.com/charm/62ptdfbrjpw3n9tcnswjpart30jauc6wc5wbi/mysql-image@sha256:704af773157fa6c36e5123cf04cac5375971c691b1411ab76cded596e8131dd3 sizeBytes: 204055782 - names: - docker.io/jujusolutions/juju-db@sha256:e74e1d8162e3733b7fd44f27f1d07a40f5f77293d8801e68e1e975f9df95ac45 - docker.io/jujusolutions/juju-db:4.4 sizeBytes: 171637975 - names: - gcr.io/ml-pipeline/frontend@sha256:0cdc888fb8e6446f04bb26142d99a6c116587703d667aa249f75c2807cfa2821 - gcr.io/ml-pipeline/frontend:2.3.0 sizeBytes: 170002621 - names: - docker.io/jujusolutions/jujud-operator@sha256:ee1af687189432a0d03f93d2f83da62490d69c787360df39d511470c18199159 - docker.io/jujusolutions/jujud-operator:3.4.6 sizeBytes: 160563022 - names: - registry.jujucharms.com/charm/4cyk13qm8dv3h6331811wlclu917dl2ifj5ax/oci-image@sha256:ef8bc35ff167eab9691cee9c90146d53c9919a98c85fbcff3a37b7a4232e37b3 sizeBytes: 158755784 - names: - gcr.io/ml-pipeline/api-server@sha256:39661bd823e8ef93000082c8b9c4977705f05a6b42bc36d7a02d394d9dd82a78 - gcr.io/ml-pipeline/api-server:2.3.0 sizeBytes: 158160537 - names: - registry.jujucharms.com/charm/4jn95wlrsfo1po9wfuy1o9eysk6tfrp3f3uu8/profile-image@sha256:4eb050d0f288cf7bf542d5a2701167c5a59b2ce5bd7c6e274d88a0d0644e9a32 sizeBytes: 152854400 - names: - gcr.io/ml-pipeline/frontend@sha256:e80605188e8f02dcbcbc2f68fee00625b58599b95966d096ee596b1a4a7d9398 - gcr.io/ml-pipeline/frontend:2.0.3 sizeBytes: 149031122 - names: - registry.jujucharms.com/charm/4jn95wlrsfo1po9wfuy1o9eysk6tfrp3f3uu8/kfam-image@sha256:b05fe3d701ce1e3259fe05ce08fc6c6a19b804ae32bbef821de28a2d83314394 sizeBytes: 128186169 - names: - docker.io/charmedkubeflow/argoexec@sha256:271a629e031adf7220665efbb37abead81f14d625d8de19cc82b58d0fd9362f2 - docker.io/charmedkubeflow/argoexec:3.4.16-ffcffa9 sizeBytes: 105201525 - names: - docker.io/istio/proxyv2@sha256:5b1a4b45e1e5589f00dde00b724c16511f192e0174f7c3827d3d79f68e33c34b - docker.io/istio/proxyv2:1.22.0 sizeBytes: 100152421 - names: - docker.io/calico/cni@sha256:9a2c99f0314053aa11e971bd5d72e17951767bf5c6ff1fd9c38c4582d7cb8a0a - docker.io/calico/cni:v3.25.1 sizeBytes: 89884044 - names: - docker.io/calico/node@sha256:0cd00e83d06b3af8cd712ad2c310be07b240235ad7ca1397e04eb14d20dcc20f - docker.io/calico/node:v3.25.1 sizeBytes: 88335791 - names: - registry.jujucharms.com/charm/evuh9usc9rqbwxwj0i9lkkjpzs1dwy053isqa/oci-image@sha256:a459ff982790f255bc17f129992b0dc2c6dfed3e087482159eca908a5d9f2b28 sizeBytes: 86171817 - names: - docker.io/istio/pilot@sha256:4eb2620d0cd1c775d7c9963f29b177334cdfe560d99caeb8cdd3a1ac9d867533 - docker.io/istio/pilot:1.22.0 sizeBytes: 79243065 - names: - docker.io/charmedkubeflow/metacontroller@sha256:643acced761f962a404f19b2dd120443589495203790d8adadc41c64cd515502 - docker.io/charmedkubeflow/metacontroller:3.0.0-0ac1edc sizeBytes: 77077117 - names: - docker.io/jujusolutions/charm-base@sha256:40903f4c911c7688688a8e04706597ebaa4272712549f2f14b7651b394a63005 - docker.io/jujusolutions/charm-base:ubuntu-22.04 sizeBytes: 74499937 - names: - docker.io/jujusolutions/charm-base@sha256:48733ce0efc196f36e9a1a5b27e0ad1813f21671fc69bd55aa46a6d732a420d5 - docker.io/jujusolutions/charm-base:ubuntu-20.04 sizeBytes: 71435531 - names: - gcr.io/ml-pipeline/viewer-crd-controller@sha256:8cf8213d69e40e0af1bb7258a095486f309a82d510ae0439992b0db8b357de7c - gcr.io/ml-pipeline/viewer-crd-controller:2.3.0 sizeBytes: 47280422 - names: - quay.io/metallb/speaker@sha256:839ca1f96149ec65b3af5aa20606096bf1bd7d43727611a5ae16de21e0c32fcd - quay.io/metallb/speaker:v0.13.3 sizeBytes: 46826323 - names: - gcr.io/ml-pipeline/scheduledworkflow@sha256:f7e67e0bc071de53bc14ff136ff4e78c3a20411852033b8d5ae8649d05516300 - gcr.io/ml-pipeline/scheduledworkflow:2.3.0 sizeBytes: 37433659 - names: - gcr.io/ml-pipeline/persistenceagent@sha256:109ac1b38c413e70d0cb6d2a83215c6b1db7f5f3284019ccb4dbbfe121f64d84 - gcr.io/ml-pipeline/persistenceagent:2.3.0 sizeBytes: 34801074 - names: - gcr.io/ml-pipeline/kfp-driver@sha256:3c0665cd36aa87e4359a4c8b6271dcba5bdd817815cd0496ed12eb5dde5fd2ec sizeBytes: 33202664 - names: - docker.io/calico/kube-controllers@sha256:02c1232ee4b8c5a145c401ac1adb34a63ee7fc46b70b6ad0a4e068a774f25f8a - docker.io/calico/kube-controllers:v3.25.1 sizeBytes: 31908610 - names: - gcr.io/ml-pipeline/kfp-launcher@sha256:8fe5e6e4718f20b021736022ad3741ddf2abd82aa58c86ae13e89736fdc3f08f sizeBytes: 31148080 - names: - quay.io/metallb/controller@sha256:12fd9f57369003ea2cf503e4464baf70de08d00e689412277125ea605f7260e8 - quay.io/metallb/controller:v0.13.3 sizeBytes: 25587407 - names: - docker.io/ubuntu/python@sha256:6bd1c73f437f1558d7b473fc3259640b65b792cd5ab916cf7879766ad763aa87 - docker.io/ubuntu/python:3.8-20.04_edge sizeBytes: 16555586 - names: - docker.io/coredns/coredns@sha256:a0ead06651cf580044aeb0a0feba63591858fb2e43ade8c9dea45a6a89ae7e5e - docker.io/coredns/coredns:1.10.1 sizeBytes: 16190758 - names: - docker.io/cdkbot/hostpath-provisioner@sha256:ac51e50e32b70e47077fe90928a7fe4d3fc8dd49192db4932c2643c49729c2eb - docker.io/cdkbot/hostpath-provisioner:1.5.0 sizeBytes: 11717290 - names: - registry.k8s.io/pause@sha256:bb6ed397957e9ca7c65ada0db5c5d1c707c9c8afc80a94acbe69f3ae76988f0c - registry.k8s.io/pause:3.7 sizeBytes: 311278 nodeInfo: architecture: amd64 bootID: a161f57c-5039-401c-a14e-89269da45048 containerRuntimeVersion: containerd://1.6.28 kernelVersion: 5.15.0-1074-azure kubeProxyVersion: v1.29.10 kubeletVersion: v1.29.10 machineID: b9ad7627e68e4eef8db7b98d5285b8f4 operatingSystem: linux osImage: Ubuntu Core 20 systemUUID: aa574722-ca0b-1044-8122-c325b7bbaf99 ```

We plan to refactor our tests soon and build each charm in a separate runner which will eliminate the issue of the runner being out of space. In the meantime, we will workaround this by deleting the lxd instances after build and deployment has been completed. This will be achieved with a snippet based on https://discourse.charmhub.io/t/how-to-quickly-clean-unused-lxd-instances-from-charmcraft-pack/15975

    lxc_instances = sh.lxc.list(project="charmcraft", format="json")
    lxc_instances_charmcraft = jq.compile('.[] | select(.name | startswith("charmcraft-")) | .name').input_text(lxc_instances).all()
    for instance in lxc_instances_charmcraft:
        print(f"Deleting {instance}")
        sh.lxc.delete(instance, project="charmcraft")

Implementing the above freed 11Gb actually:

df -h before test_create_and_monitor_recurring_run ASSERT
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   62G   12G  85% /

which resulted in the tests passing :tada:

In order to avoid this behaviour being the default one, we will also introduce a flag that will enable/disable this.

EDIT: After discussions with @NohaIhab , we decided to move to an approach that uses charmcraft clean for deletion purposes for a more deterministic and less error prone approach, given that the script did some assumptions (e.g. the startswith("charmcraft") part)

    lxc_instances_charmcraft = jq.compile('.[] | select(.name | startswith("charmcraft-")) | .name').input_text(lxc_instances).all()
orfeas-k commented 3 days ago

Closed by #616