actions / runner-images

GitHub Actions runner images
MIT License
9.17k stars 2.84k forks source link

Ubuntu images are running out of disk space #709

Closed jneira closed 3 years ago

jneira commented 4 years ago

Describe the bug Jobs running in Ubuntu-16.04 images are starting to throw error related with low disk space

Area for Triage: Servers

Question, Bug, or Feature?: Bug

Virtual environments affected

Expected behavior The disk should have enough free space for custom user software needed for building.

Actual behavior

  1. We can take as example this build: https://dev.azure.com/jneira/haskell-ide-engine/_build/results?buildId=758
  2. At start, after only checkout the project, disk usage(du -h) is:
    Filesystem      Size  Used Avail Use% Mounted on
    udev            3.4G     0  3.4G   0% /dev
    tmpfs           695M  8.9M  686M   2% /run
    /dev/sda1        84G   72G   12G  86% /
    tmpfs           3.4G  8.0K  3.4G   1% /dev/shm
    tmpfs           5.0M     0  5.0M   0% /run/lock
    tmpfs           3.4G     0  3.4G   0% /sys/fs/cgroup
    /dev/loop0       40M   40M     0 100% /snap/hub/43
    /dev/loop1       94M   94M     0 100% /snap/core/8935
    /dev/sda15      105M  3.6M  101M   4% /boot/efi
    /dev/sdb1        14G   35M   13G   1% /mnt

    Just before the error due to not enough disk space:

    Filesystem      Size  Used Avail Use% Mounted on
    udev            3.4G     0  3.4G   0% /dev
    tmpfs           695M  8.9M  686M   2% /run
    /dev/sda1        84G   80G  3.6G  96% /
    tmpfs           3.4G  8.0K  3.4G   1% /dev/shm
    tmpfs           5.0M     0  5.0M   0% /run/lock
    tmpfs           3.4G     0  3.4G   0% /sys/fs/cgroup
    /dev/loop0       40M   40M     0 100% /snap/hub/43
    /dev/loop1       94M   94M     0 100% /snap/core/8935
    /dev/sda15      105M  3.6M  101M   4% /boot/efi
    /dev/sdb1        14G   35M   13G   1% /mnt

    The concrete error is:

    Unpacking GHC into /home/vsts/.stack/programs/x86_64-linux/ghc-8.8.1.temp/ ...
    Configuring GHC ...
    Installing GHC ...
    /home/vsts/.stack/programs/x86_64-linux/ghc-8.8.1/lib/ghc-8.8.1/Cabal-3.0.0.0/.copyFile74954-1209.tmp: copyFile: resource exhausted (No space left on device)
    make[1]: *** [install_packages] Error 1
    make: *** [install] Error 2
    Received ExitFailure 2 when running
    Raw command: /usr/bin/make install
    Run from: /home/vsts/.stack/programs/x86_64-linux/ghc-8.8.1.temp/ghc-8.8.1/

    But it is not important, whatever command that needs disk space will fail

maxim-lobanov commented 4 years ago

Hello @jneira , it is expected behavior. Based on official documentation, hosted agents provide at least 10 GB of storage for your source and build outputs. At the begin of your build, there are 12 GB of free space.

As a possible workarounds, you can remove the part of pre-installed software in runtime. For example, this command will release 5+ of free space:

sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
jneira commented 4 years ago

@maxim-lobanov thanks for clarification, i didnt know about that limit, but i have to admit it is reasonable. haskell builds can take a lot of space, more than usual for other langs.

I can workaround the issue using the preinstalled ghcs compilers, but i was afraid that, if the disk space continue decreasing, that workaround will not work.

Tnaks again for the quick response.

rmetzger commented 4 years ago

@maxim-lobanov thanks a lot for posting this workaround. my tests have started to fail as well because of this issue. In my case, I'm running the tests in a docker container. If anybody else is in the same situation, there is a target: host flag to run things on the host, instead of the container.

EDIT: this is the change: https://github.com/apache/flink/commit/cb8ae2892c37dd37431b1b56f96805a3dee0335d This is my "clean up script": https://github.com/apache/flink/blob/master/tools/azure-pipelines/free_disk_space.sh

saurik commented 4 years ago

So I'm pretty sure this is an actual new bug. I started running into it on Ubuntu 18.04 just a few days ago. If you look a the output of df, you will notice that the build isn't using any disk space at all on /mnt, which is where most of the disk space is available. I am guessing that /home/runner/work used to be located on /mnt, but now is being left on /, and so we have gone from having 14GB for source + build outputs and a few gigabytes for extra software we install to having only a few gigabytes total for the entire build operation. I can manually move my builds to try to use space on /mnt, but it seems like this really did change this week, I guess first for Ubuntu 16.04 and soon thereafter for Ubuntu 18.04, and that it would be better if the runner was automatically placed on /mnt. (I wish I could re-open this bug, but sadly I can't; if no one sees this comment I guess I'll open a new bug and reference this one? @maxim-lobanov)

al-cheb commented 4 years ago

@alepauly, Does it make sense to extend disk space on Ubuntu 16.04/18.04 from 84gb to at least 128gb?

Ubuntu 16.04: /dev/sda1 84G 75G 8.8G 90% /

Ubuntu 18.04: /dev/sda1 84G 75G 8.3G 91% /

jneira commented 4 years ago

I can manually move my builds to try to use space on /mnt

@saurik Maybe i did something wrong but i tried to use /mnt for placing the files of the haskell build tool stack and it failed (stack executions) due to permission issues.

matthewfeickert commented 4 years ago

So I'm pretty sure this is an actual new bug. I started running into it on Ubuntu 18.04 just a few days ago.

We saw the same thing for pyhf. We did a hack fix in pyhf PR 819 by running apt-get clean on all of the Ubuntu jobs before trying to install anything (which worked) but this came up within the last week after never having been an issue in the past (and none of our dependencies changing size significantly).

Chuxel commented 4 years ago

FWIW the same thing is happening on 18.04. I have a Docker build that has been working for months that stopped working last week.

1_Build and push (1, 4).txt

saurik commented 4 years ago

@jneira I'd totally be willing to believe /mnt isn't directly usable; like, I wasn't saying you could definitely solve the problem by using /mnt: what I was even saying is that I purposefully didn't try, as this was previously working and I expect that any temporary workaround to manually try to move something to /mnt, assuming it worked, would break later. Like, maybe it would be less confusing if I just talked about /dev/sdb1 instead of /mnt, as that partition could potentially have been mounted to /home or something before: I have no clue what the previous configuration of this system was where we had lots and lots more space was; but I've seen lots of people talk about these machines having 14GB of secondary disk available and GitHub saying we have at least 10 GB of storage for code and build outputs, yet for that all to be true we would have to be working on that secondary disk, not the primary disk we are currently defaulting to.

alepauly commented 4 years ago

@alepauly, Does it make sense to extend disk space on Ubuntu 16.04/18.04 from 84gb to at least 128gb?

@alcheb - Unfortunately we can't easily do that at the time.

Thanks everyone(@saurik, @Chuxel, @matthewfeickert, @jneira, @rmetzger) for the input and apologies for the pain this is causing! Seems to me the simplest workaround is to apt clean, which should be pretty fast and reclaim quite a bit of space.

We're looking for mitigations we can apply quickly and safely but in the meantime, adding the workaround manually is your best bet to unblock your workflows.

We'll post an update here as soon as we get something rolling.

obabec commented 4 years ago

I can agree with all you :) Our build was running smooth for couple weeks / months. We were using /mnt as docker root. We are running reallly simple minikube on runnners and now we are not able to even initialize first minikube log. Same disk usage about 91%. /mnt disk is for some reason not usable now.

alepauly commented 4 years ago

/mnt disk is for some reason not usable now.

@obabec, we'll look into this.

matthewfeickert commented 4 years ago

Seems to me the simplest workaround is to apt clean, which should be pretty fast and reclaim quite a bit of space.

Thanks very much for the reply and status update @alepauly. This is useful to know how we can proceed for the time being (for the pyhf dev team we're already doing this). :+1:

I'll add that as a naive user to all the intricacies of what is actually happening, it was very strange that this is happening to only the ubuntu-latest builds and that all of our macos-latest builds are entirely unaffected. It was for this reason that the pyhf dev team viewed this as a bug on the GitHub side of things and not that we had somehow started exceeding our allotted space mysteriously.

alepauly commented 4 years ago

@matthewfeickert fwiw, I would also view it as a bug - because you're not exceeding the space we make available, and because things all of a sudden changed and broke you. We try hard for this not to happen but it happens sometimes. We'll figure out a solution as soon as possible (probably will take a day or two to replicate everywhere) so you can remove the workaround.

MitchK commented 4 years ago

We have the same issue with react-native / expo builds using https://github.com/expo/turtle. A decent amount of disk space is taken up by NPM dependencies, but a huge part is generated during the build (build tools, binaries, expo shell app, ...) that we don't have any control over: https://github.com/expo/turtle/issues/213

As app developers, we don't want to manage any VMs (i.e. self-hosted runners). Instead, we would prefer trading a few more build minutes for more disk space.💰💰💰

matthewfeickert commented 4 years ago

We'll figure out a solution as soon as possible (probably will take a day or two to replicate everywhere) so you can remove the workaround.

Sounds great. Thanks very much for responding quickly on this and for all the hard work!

alepauly commented 4 years ago

We've started a rollback to the previous version of the VM images used for the virtual-environments in both Ubuntu 16.04 and Ubuntu 18.04. This should take from a few hours to a day, please keep us posted if you don't see mitigation after that. We'll continue working on the fixes so we can roll out the updates soon after the mitigation.

congyiwu commented 4 years ago

I reopened an orphaned issue: https://github.com/microsoft/azure-pipelines-image-generation/issues/1242 as #751

congyiwu commented 4 years ago

Also filed https://github.com/actions/virtual-environments/issues/752 to catch this ahead of time in the future

saurik commented 4 years ago

From this past week (so where we didn't have space):

Filesystem     1K-blocks     Used Available Use% Mounted on
udev             3543000        0   3543000   0% /dev
tmpfs             711216      960    710256   1% /run
/dev/sda1       87218124 78537516   8664224  91% /
tmpfs            3556064        8   3556056   1% /dev/shm
tmpfs               5120        0      5120   0% /run/lock
tmpfs            3556064        0   3556064   0% /sys/fs/cgroup
/dev/loop0         96128    96128         0 100% /snap/core/8935
/dev/loop1         40320    40320         0 100% /snap/hub/43
/dev/sda15        106858     3668    103190   4% /boot/efi
/dev/sdb1       14383048    40988  13591728   1% /mnt

After this new rollback (so what this was before last week):

Filesystem     1K-blocks     Used Available Use% Mounted on
udev             3543004        0   3543004   0% /dev
tmpfs             711224      956    710268   1% /run
/dev/sda1       87218124 73920632  13281108  85% /
tmpfs            3556108        8   3556100   1% /dev/shm
tmpfs               5120        0      5120   0% /run/lock
tmpfs            3556108        0   3556108   0% /sys/fs/cgroup
/dev/loop0         40320    40320         0 100% /snap/hub/43
/dev/loop1         96128    96128         0 100% /snap/core/8935
/dev/sda15        106858     3668    103190   4% /boot/efi
/dev/sdb1       14383048    40988  13591728   1% /mnt

OK, so notably we suddenly had 4.6GB less disk space on /.

The GitHub people probably already know what is going on, but for the rest of us trying to come up with theories, here is mine: last week, as part of https://github.com/actions/virtual-environments/commit/8abd45c3a8c2f777498e0df36853faf163558b15 (merging PR #711 closing issue #643), GitHub added a side-by-side installation of the Android NDK r20. These alternative NDK versions go in folders such as $ANDROID_HOME/ndk/20.0.5594570 and compliment the "default" NDK in $ANDROID_HOME/ndk-bundle that always tracks the latest release (which is currently r21). This alone used an additional 3.8GB (and then the rest was likely some other random stuff that got added or changed).

Given that there is supposedly 14GB of disk space sitting on /dev/sdb1 as part of the local disk, it would make the most sense to me (assuming this is possible, and given the person who said they were previously using /mnt successfully it sounds like it should be) to put the runner home directory there, so that the default code checkout process--as well as many of the "user" package managers, including from npm, dart, rust, and ghc--would put all of their files there: that disk feels more like "our space" than what is left over on / and is thereby likely going to be much less variable over time.

miketimofeev commented 4 years ago

@obabec could you please provide more details about /mnt being inaccessible? I've tried to create files and copy directories there and everything worked for me(with sudo).

chuckatkins commented 4 years ago

I'm hitting this issue consistently. My builds are running inside a container with a ~3g image, df reports ~7.5g of free space before I start the build, the build tree itself takes up ~1g when complete. But the build fails regularly with "no space left on device" when it should never be using anywhere close to the limit. If I was running directly on the VM then I could just run some apt cleanup steps to remove a bunch of packages I don't use and free up space but since the actions run in a container then AFAICT there's no way to run a cleanup step directly on the VM prior to the in-container steps; or am I wrong on that, is there a way to directly run some steps on the host and others in the container?

Perhaps it would be worth having a distinct container host image that's a bare bones install with just` enough to run docker containers instead of using the full blown 80g ubuntu image?

miketimofeev commented 4 years ago

@chuckatkins Does it still happen after image rollback? There should be more than 10gb available now.

alepauly commented 4 years ago

@saurik your analysis is pretty spot on 🎯 The work dir should be under /mnt, and we are planning to move it there. But because it's a breaking change we'll have to announce it first and give time to people that might depend on a hard coded /home/... path to adjust.

saurik commented 4 years ago

@alepauly Why not symlink /home to /mnt/home, or mount /home to space on sdb1 (maybe using btrfs/lvm or whatever it is that let's you "share" partition space among multiple mount points), so this becomes more an implementation detail of the container than some filesystem change? FWIW, I would personally prefer to see folders like $HOME/.pub-cache (Dart) and $HOME/.cargo (Rust) end up on sdb1, not just the work directory (though maybe you consider all of /home/runner to be the "work directory").

saurik commented 4 years ago

(BTW, I just noticed that time is getting away from me--I am going to blame the lack of any real external time markers due to the pandemic, even though I honestly always had this problem ;P--and that commit to add the NDK "last week" happened after this bug was opened, apparently now well over a week ago; so my guess that it was the r20 NDK that added the brunt of the loss can't be right? Worse, that would mean that after that NDK would have been added--I say "would have been as I see that was now reverted"--we would be down to less than 5GB free on that partition... there really isn't that much disk space available to play with there for the base installed system software.)

alepauly commented 4 years ago

@alepauly Why not symlink /home to /mnt/home

yeah, we were actually discussing doing bind mounts, the only minor risk is collision of anyone already using /mnt/<whatever>. Thanks for sharing your ideas!

alepauly commented 4 years ago

and that commit to add the NDK "last week" happened after this bug was opened, apparently now well over a week ago; so my guess that it was the r20 NDK that added the brunt of the loss can't be right?

Good point, @miketimofeev has been looking at it and he might have a better idea of other things that came in earlier and that affected space usage. We rather not revert a lot of what we already shipped since some workflows might have already taken a dependency, but since the NDK v20 was very recent and a major contributor we backed it out.

domenkozar commented 4 years ago

I've also seen this, see https://github.com/cachix/cachix-action/issues/43

domenkozar commented 4 years ago

Is there some progress to fix this? I'm currently deleting /opt but getting a lot of support from people to workaround this issue.

maxim-lobanov commented 4 years ago

@domenkozar , how much space do you have right now at the build start? It should be 14 GB. Please let us know if you have less

domenkozar commented 4 years ago

/dev/sda1 84G 71G 13G 86% /

maxim-lobanov commented 4 years ago

Thanks! Based on documentation, runners should have 14 GB of free disk space and currently, we are deploying the new image where 14 GB should be available.

domenkozar commented 4 years ago

@maxim-lobanov What's the reason for / to have 14GB and /mnt to have 14GB? It would be preferable to have both combined on /.

miketimofeev commented 4 years ago

Hi everyone! We've switched working directory to /mnt and started to propagate the changes throughout the environments. It will take about 3-4 days.

miketimofeev commented 4 years ago

@domenkozar Unfortunately, it's not possible in the current virtualization scheme.

alepauly commented 4 years ago

@maxim-lobanov What's the reason for / to have 14GB and /mnt to have 14GB? It would be preferable to have both combined on /.

@domenkozar yeah, the reason is that the /mnt space is on a temp disk added to the vm, it's not the same device.

miketimofeev commented 3 years ago

Unfortunately, we've faced unpredictable issues when /mnt is used as a working directory https://github.com/actions/virtual-environments/issues/922 We're going to rollback the changes today.

domenkozar commented 3 years ago

I'd love if you find a way to just increase root disk space, that would from user perspective be a win-win.

I do understand you have your own design restrictions, keeping my fingers crossed.

1orz commented 3 years ago

Unfortunately, we've faced unpredictable issues when /mnt is used as a working directory

922

We're going to rollback the changes today.

How long does the rollback operation take to complete? When can it take effect on Github actions?

miketimofeev commented 3 years ago

@1orz it should take about 4-6 more hours

saurik commented 3 years ago

@miketimofeev From the linked GitHub Community Forum report, it seems like /home/runner/work got put on the new partition while /home/runner wasn't. Seeing this now (I failed to predict this), that does not surprise me as something that would break peoples' assumptions :(... being able to easily move files around within one's own $HOME using the rename() syscall seems very reasonable.

downloading kind from https://github.com/kubernetes-sigs/kind/releases/download/v0.8.1/kind-linux-amd64
chmod +x /home/runner/work/_temp/2eecb160-a46e-44cf-9217-e333cb604de4
##[error]EXDEV: cross-device link not permitted, rename '/home/runner/work/_temp/2eecb160-a46e-44cf-9217-e333cb604de4' -> '/home/runner/bin/kind'

My recommendation here would be to make /home/runner or /home be the mountpoint. I would be absolutely shocked if that broke anyone: /home is often mounted as its own partition on normal machines (so /home should definitely work), and there is a reasonable assumption that you can't access other peoples' home directories (and so can't be trying to move a file between them).

(FWIW, I've had /home/saurik on my machines on a separate partition from /home at least 15 years now without ever running into an issue like this, but seeing the error I'm like "ah yeah, ok: if I were to have a random subfolder of my home directory in a separate partition from the rest of it that would certainly end up driving me crazy and would break all kinds of things I do on a regular basis".)

miketimofeev commented 3 years ago

@saurik thanks for your suggestions! The only reason we haven't done it with the home directory is that it's more complicated in terms of VM creation logic. We'll try to look at it one more time. Another thought is to move /swap on the /mnt.

maxim-lobanov commented 3 years ago

I am closing this issue since currently, we have more than 14 GB on Ubuntu images and initial issue should be resolved. We will continue to work on release more free disk space on Ubuntu images to accept new feature requests. Please let us know if you have any concerns / suggestions