canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
104 stars 50 forks source link

`Maximise GH runner space` step failing in some repositories CIs #813

Closed DnPlas closed 5 months ago

DnPlas commented 9 months ago

Bug Description

Running the automated integration tests in the CI on a PR is not possible as theMaximise GH runner space step is failing with the following message:

Unmounting and removing swap file.
Creating LVM Volume.
  Creating LVM PV on root fs.
fallocate: invalid length value specified
Error: Process completed with exit code 1.

To Reproduce

Create a PR in any of the Charmed Kubeflow owned repositories where the aforementioned step is used.

Environment

CI environment.

Relevant Log Output

Besides the one provided already, you can refer to this error

Affected repositories (from PRs)

syncronize-issues-to-jira[bot] commented 9 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5300.

This message was autogenerated

NohaIhab commented 9 months ago

There's a similar issue filed in the action's repo https://github.com/easimon/maximize-build-space/issues/38

NohaIhab commented 9 months ago

Looking at the runner's storage with the different ubuntu image version: image version 20240126.1.0 (the older one):

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        84G   54G   30G  65% /
devtmpfs        7.9G     0  7.9G   0% /dev
tmpfs           7.9G  4.0K  7.9G   1% /dev/shm
tmpfs           1.6G  1.2M  1.6G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/loop0       64M   64M     0 100% /snap/core20/2015
/dev/loop1       41M   41M     0 100% /snap/snapd/20290
/dev/sda15      105M  6.1M   99M   6% /boot/efi
/dev/loop2       92M   92M     0 100% /snap/lxd/2[40](https://github.com/canonical/mlflow-operator/actions/runs/7726760039/job/21063679458#step:2:42)61
/dev/sdb1        63G  4.1G   56G   7% /mnt
tmpfs           1.6G     0  1.6G   0% /run/user/1001

image version 20240131.1.0 (the current one):

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   54G   19G  75% /
devtmpfs        7.9G     0  7.9G   0% /dev
tmpfs           7.9G  4.0K  7.9G   1% /dev/shm
tmpfs           1.6G  1.2M  1.6G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/loop0       64M   64M     0 100% /snap/core20/2105
/dev/sdb15      105M  6.1M   99M   6% /boot/efi
/dev/loop2       41M   41M     0 100% /snap/snapd/20671
/dev/loop1       92M   92M     0 100% /snap/lxd/2[40](https://github.com/canonical/mlflow-operator/actions/runs/7782600114/job/21219363719#step:3:42)61
/dev/sda1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G     0  1.6G   0% /run/user/1001

it seems that the distribution of space on the runner has changed, where:

so the action is no longer able to free on root filesystem 40G, specified in the input root-reserve-mb in the past the action was able to get the available space on root from 30G to 40G, so now it should be able to get root from 19G to 29G, setting the input root-reserve-mb as 29696 (the input is in mb)

Note that the freed up space at the end should not be affected, because the extra space in the temp disk will be utilized

NohaIhab commented 9 months ago

I tested with root-reserve-mb set to 29696 in this run

it would be ideal if we can make the input dynamic, but I'm not sure if that's feasible

DnPlas commented 9 months ago

Thanks for looking into it @NohaIhab ! I think the approach is right, we need to decrease the root-reserve-mb number to fit the new storage size. Let's do a cannon run for all the affected repositories.

misohu commented 7 months ago

I still see the issue e.g. in https://github.com/canonical/kfp-operators/pull/416

looks like there is not enough disk space even after the @NohaIhab PR. I debugged by ssh into the worker after the failed tests. Disk space is almost completely exhausted

runner@fv-az572-42:~/work/kfp-operators/kfp-operators$ df
Filesystem                  1K-blocks     Used Available Use% Mounted on
/dev/root                    76026616 74390372   1619860  99% /
devtmpfs                      8183156        0   8183156   0% /dev
tmpfs                         8187672        4   8187668   1% /dev/shm
tmpfs                         1637536     3212   1634324   1% /run
tmpfs                            5120        0      5120   0% /run/lock
tmpfs                         8187672        0   8187672   0% /sys/fs/cgroup
/dev/sdb15                     106858     6186    100673   6% /boot/efi
/dev/loop0                      65536    65536         0 100% /snap/core20/2182
/dev/loop1                      40064    40064         0 100% /snap/snapd/21184
/dev/loop2                      94080    94080         0 100% /snap/lxd/24061
/dev/sda1                    76829444 71714284   1166720  99% /mnt
tmpfs                         1637532        0   1637532   0% /run/user/1001
/dev/mapper/buildvg-buildlv  76661516   264724  76380408   1% /home/runner/work/kfp-operators/kfp-operators
/dev/loop5                     106496   106496         0 100% /snap/core/16928
/dev/loop6                      76032    76032         0 100% /snap/core22/1122
/dev/loop7                     152192   152192         0 100% /snap/lxd/27049
tmpfs                            1024        0      1024   0% /var/snap/lxd/common/ns
/dev/loop8                      93568    93568         0 100% /snap/juju/25751
/dev/loop9                        256      256         0 100% /snap/jq/6
/dev/loop10                     28032    28032         0 100% /snap/charm/712
/dev/loop11                     29312    29312         0 100% /snap/charmcraft/2453
/dev/loop12                      1536     1536         0 100% /snap/juju-bundle/25
/dev/loop13                     12544    12544         0 100% /snap/juju-crashdump/271
/dev/loop14                     57088    57088         0 100% /snap/core18/2812
/dev/loop15                    167552   167552         0 100% /snap/microk8s/6575
/dev/loop16                     12288    12288         0 100% /snap/kubectl/3206

I can also see pods not being scheduled because of

Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3m13s  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

Did more digging and the main problem are the images in microk8s which we use in kfp-bundle tests which take 15Gbs of disk space. Some of these images are deployed multiple times as pods.

ca-scribner commented 7 months ago

In investigating canonical/kfp-operators#426 I noticed how the easimon/maximize-build-space no longer frees up close to as much space as it used to:

Prior to Jan 2024:

log snippet from disk space step Run echo "Memory and swap:" echo "Memory and swap:" free echo swapon --show echo echo "Available storage:" df -h shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} Memory and swap: total used free shared buff/cache available Mem: 16375356 689700 13991684 34480 1693972 15297100 Swap: 4194300 0 4194300 NAME TYPE SIZE USED PRIO /dev/dm-0 partition 4G 0B -2 Available storage: Filesystem Size Used Avail Use% Mounted on /dev/root 84G 44G 40G 52% / ...

After to Jan 2024:

log snippet from disk space step Run echo "Memory and swap:" Memory and swap: total used free shared buff/cache available Mem: 16375356 707068 14054144 34340 1614144 15280072 Swap: 4194300 0 4194300 NAME TYPE SIZE USED PRIO /dev/dm-0 partition 4G 0B -2 Available storage: Filesystem Size Used Avail Use% Mounted on /dev/root 73G 44G 29G 60% / ...

Probably this happened at the same time as the GH runner change that is discussed in this issue.

ca-scribner commented 7 months ago

fwiw, I've found the jlumbroso/free-disk-space action with default settings is working better, leaving a runner with ~45GB free after execution. An example of this is in kfp's tests

DnPlas commented 5 months ago

I haven't seen this issue anymore, and we changed a lot of CIs around. Will close it because all our CIs that use the mazimise runner space action are not failing anymore (see all the attached PRs and commits).