Closed orfeas-k closed 3 days ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6575.
This message was autogenerated
Adding a df -h
before every test showed that this is a disk size issue:
# df -h before test_create_and_monitor_recurring_run
Filesystem Size Used Avail Use% Mounted on
/dev/root 73G 72G 1.4G 99% /
devtmpfs 7.9G 0 7.9G 0% /dev
tmpfs 7.9G 4.0K 7.9G 1% /dev/shm
tmpfs 1.6G 3.2M 1.6G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
/dev/sdb15 105M 6.1M 99M 6% /boot/efi
/dev/loop0 64M 64M 0 100% /snap/core20/2379
/dev/loop1 39M 39M 0 100% /snap/snapd/21759
/dev/loop2 92M 92M 0 100% /snap/lxd/29619
/dev/sda1 74G 28K 70G 1% /mnt
tmpfs 1.6G 4.0K 1.6G 1% /run/user/1001
/dev/loop3 105M 105M 0 100% /snap/core/17[200](https://github.com/canonical/kfp-operators/actions/runs/11909109360/job/33193929888#step:6:201)
/dev/loop4 74M 74M 0 100% /snap/core22/1663
/dev/loop5 105M 105M 0 100% /snap/lxd/30130
tmpfs 1.0M 0 1.0M 0% /var/snap/lxd/common/ns
/dev/loop6 95M 95M 0 100% /snap/juju/28491
/dev/loop7 28M 28M 0 100% /snap/charm/712
/dev/loop8 59M 59M 0 100% /snap/charmcraft/4914
/dev/loop9 256K 256K 0 100% /snap/jq/6
/dev/loop10 1.5M 1.5M 0 100% /snap/juju-bundle/25
/dev/loop11 13M 13M 0 100% /snap/juju-crashdump/271
/dev/loop12 163M 163M 0 100% /snap/microk8s/7396
/dev/loop13 13M 13M 0 100% /snap/kubectl/3446
This explains also why minio
unit is on blocked after tests complete
minio/0* blocked idle 9000-9001/TCP 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes ..
with its pod being pending and the node having a NoSchedule
taint due to disk-pressure
.
We plan to refactor our tests soon and build each charm in a separate runner which will eliminate the issue of the runner being out of space. In the meantime, we will workaround this by deleting the lxd instances after build and deployment has been completed. This will be achieved with a snippet based on https://discourse.charmhub.io/t/how-to-quickly-clean-unused-lxd-instances-from-charmcraft-pack/15975
lxc_instances = sh.lxc.list(project="charmcraft", format="json")
lxc_instances_charmcraft = jq.compile('.[] | select(.name | startswith("charmcraft-")) | .name').input_text(lxc_instances).all()
for instance in lxc_instances_charmcraft:
print(f"Deleting {instance}")
sh.lxc.delete(instance, project="charmcraft")
Implementing the above freed 11Gb actually:
df -h before test_create_and_monitor_recurring_run ASSERT
Filesystem Size Used Avail Use% Mounted on
/dev/root 73G 62G 12G 85% /
which resulted in the tests passing :tada:
In order to avoid this behaviour being the default one, we will also introduce a flag that will enable/disable this.
EDIT:
After discussions with @NohaIhab , we decided to move to an approach that uses charmcraft clean
for deletion purposes for a more deterministic and less error prone approach, given that the script did some assumptions (e.g. the startswith("charmcraft") part)
lxc_instances_charmcraft = jq.compile('.[] | select(.name | startswith("charmcraft-")) | .name').input_text(lxc_instances).all()
Closed by #616
Bug Description
Intermittently with #601, tests fail during test_create_and_monitor_recurring_run:
Looking at the juju status, it looks like node could be potentially out of space
To Reproduce
Rerun CI from PR https://github.com/canonical/kfp-operators/pull/583
Environment
Juju 3.4.6 Microk8s 1.29
Relevant Log Output
Additional Context
No response