Closed DnPlas closed 5 months ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5300.
This message was autogenerated
There's a similar issue filed in the action's repo https://github.com/easimon/maximize-build-space/issues/38
Looking at the runner's storage with the different ubuntu image version:
image version 20240126.1.0
(the older one):
Filesystem Size Used Avail Use% Mounted on
/dev/root 84G 54G 30G 65% /
devtmpfs 7.9G 0 7.9G 0% /dev
tmpfs 7.9G 4.0K 7.9G 1% /dev/shm
tmpfs 1.6G 1.2M 1.6G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
/dev/loop0 64M 64M 0 100% /snap/core20/2015
/dev/loop1 41M 41M 0 100% /snap/snapd/20290
/dev/sda15 105M 6.1M 99M 6% /boot/efi
/dev/loop2 92M 92M 0 100% /snap/lxd/2[40](https://github.com/canonical/mlflow-operator/actions/runs/7726760039/job/21063679458#step:2:42)61
/dev/sdb1 63G 4.1G 56G 7% /mnt
tmpfs 1.6G 0 1.6G 0% /run/user/1001
image version 20240131.1.0
(the current one):
Filesystem Size Used Avail Use% Mounted on
/dev/root 73G 54G 19G 75% /
devtmpfs 7.9G 0 7.9G 0% /dev
tmpfs 7.9G 4.0K 7.9G 1% /dev/shm
tmpfs 1.6G 1.2M 1.6G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
/dev/loop0 64M 64M 0 100% /snap/core20/2105
/dev/sdb15 105M 6.1M 99M 6% /boot/efi
/dev/loop2 41M 41M 0 100% /snap/snapd/20671
/dev/loop1 92M 92M 0 100% /snap/lxd/2[40](https://github.com/canonical/mlflow-operator/actions/runs/7782600114/job/21219363719#step:3:42)61
/dev/sda1 74G 4.1G 66G 6% /mnt
tmpfs 1.6G 0 1.6G 0% /run/user/1001
it seems that the distribution of space on the runner has changed, where:
84G
to 73G
i.e. a decrease of 11G
/mnt
increased from total 63G
to 74G
i.e. an increase of 11G
so the action is no longer able to free on root filesystem 40G
, specified in the input root-reserve-mb
in the past the action was able to get the available space on root from 30G
to 40G
, so now it should be able to get root from 19G
to 29G
, setting the input root-reserve-mb
as 29696
(the input is in mb)
Note that the freed up space at the end should not be affected, because the extra space in the temp disk will be utilized
I tested with root-reserve-mb
set to 29696
in this run
it would be ideal if we can make the input dynamic, but I'm not sure if that's feasible
Thanks for looking into it @NohaIhab ! I think the approach is right, we need to decrease the root-reserve-mb
number to fit the new storage size. Let's do a cannon run for all the affected repositories.
I still see the issue e.g. in https://github.com/canonical/kfp-operators/pull/416
looks like there is not enough disk space even after the @NohaIhab PR. I debugged by ssh into the worker after the failed tests. Disk space is almost completely exhausted
runner@fv-az572-42:~/work/kfp-operators/kfp-operators$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 76026616 74390372 1619860 99% /
devtmpfs 8183156 0 8183156 0% /dev
tmpfs 8187672 4 8187668 1% /dev/shm
tmpfs 1637536 3212 1634324 1% /run
tmpfs 5120 0 5120 0% /run/lock
tmpfs 8187672 0 8187672 0% /sys/fs/cgroup
/dev/sdb15 106858 6186 100673 6% /boot/efi
/dev/loop0 65536 65536 0 100% /snap/core20/2182
/dev/loop1 40064 40064 0 100% /snap/snapd/21184
/dev/loop2 94080 94080 0 100% /snap/lxd/24061
/dev/sda1 76829444 71714284 1166720 99% /mnt
tmpfs 1637532 0 1637532 0% /run/user/1001
/dev/mapper/buildvg-buildlv 76661516 264724 76380408 1% /home/runner/work/kfp-operators/kfp-operators
/dev/loop5 106496 106496 0 100% /snap/core/16928
/dev/loop6 76032 76032 0 100% /snap/core22/1122
/dev/loop7 152192 152192 0 100% /snap/lxd/27049
tmpfs 1024 0 1024 0% /var/snap/lxd/common/ns
/dev/loop8 93568 93568 0 100% /snap/juju/25751
/dev/loop9 256 256 0 100% /snap/jq/6
/dev/loop10 28032 28032 0 100% /snap/charm/712
/dev/loop11 29312 29312 0 100% /snap/charmcraft/2453
/dev/loop12 1536 1536 0 100% /snap/juju-bundle/25
/dev/loop13 12544 12544 0 100% /snap/juju-crashdump/271
/dev/loop14 57088 57088 0 100% /snap/core18/2812
/dev/loop15 167552 167552 0 100% /snap/microk8s/6575
/dev/loop16 12288 12288 0 100% /snap/kubectl/3206
I can also see pods not being scheduled because of
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m13s default-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
Did more digging and the main problem are the images in microk8s which we use in kfp-bundle tests which take 15Gbs of disk space. Some of these images are deployed multiple times as pods.
In investigating canonical/kfp-operators#426 I noticed how the easimon/maximize-build-space
no longer frees up close to as much space as it used to:
Probably this happened at the same time as the GH runner change that is discussed in this issue.
fwiw, I've found the jlumbroso/free-disk-space action with default settings is working better, leaving a runner with ~45GB free after execution. An example of this is in kfp's tests
I haven't seen this issue anymore, and we changed a lot of CIs around. Will close it because all our CIs that use the mazimise runner space action are not failing anymore (see all the attached PRs and commits).
Bug Description
Running the automated integration tests in the CI on a PR is not possible as the
Maximise GH runner space
step is failing with the following message:To Reproduce
Create a PR in any of the Charmed Kubeflow owned repositories where the aforementioned step is used.
Environment
CI environment.
Relevant Log Output
Besides the one provided already, you can refer to this error
Affected repositories (from PRs)