Closed jneira closed 3 years ago
Hello @jneira , it is expected behavior. Based on official documentation, hosted agents provide at least 10 GB of storage for your source and build outputs. At the begin of your build, there are 12 GB of free space.
As a possible workarounds, you can remove the part of pre-installed software in runtime. For example, this command will release 5+ of free space:
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
@maxim-lobanov thanks for clarification, i didnt know about that limit, but i have to admit it is reasonable. haskell builds can take a lot of space, more than usual for other langs.
I can workaround the issue using the preinstalled ghc
s compilers, but i was afraid that, if the disk space continue decreasing, that workaround will not work.
Tnaks again for the quick response.
@maxim-lobanov thanks a lot for posting this workaround.
my tests have started to fail as well because of this issue. In my case, I'm running the tests in a docker container. If anybody else is in the same situation, there is a target: host
flag to run things on the host, instead of the container.
EDIT: this is the change: https://github.com/apache/flink/commit/cb8ae2892c37dd37431b1b56f96805a3dee0335d This is my "clean up script": https://github.com/apache/flink/blob/master/tools/azure-pipelines/free_disk_space.sh
So I'm pretty sure this is an actual new bug. I started running into it on Ubuntu 18.04 just a few days ago. If you look a the output of df, you will notice that the build isn't using any disk space at all on /mnt, which is where most of the disk space is available. I am guessing that /home/runner/work used to be located on /mnt, but now is being left on /, and so we have gone from having 14GB for source + build outputs and a few gigabytes for extra software we install to having only a few gigabytes total for the entire build operation. I can manually move my builds to try to use space on /mnt, but it seems like this really did change this week, I guess first for Ubuntu 16.04 and soon thereafter for Ubuntu 18.04, and that it would be better if the runner was automatically placed on /mnt. (I wish I could re-open this bug, but sadly I can't; if no one sees this comment I guess I'll open a new bug and reference this one? @maxim-lobanov)
@alepauly, Does it make sense to extend disk space on Ubuntu 16.04/18.04 from 84gb to at least 128gb?
Ubuntu 16.04:
/dev/sda1 84G 75G 8.8G 90% /
Ubuntu 18.04:
/dev/sda1 84G 75G 8.3G 91% /
I can manually move my builds to try to use space on /mnt
@saurik Maybe i did something wrong but i tried to use /mnt
for placing the files of the haskell build tool stack
and it failed (stack
executions) due to permission issues.
So I'm pretty sure this is an actual new bug. I started running into it on Ubuntu 18.04 just a few days ago.
We saw the same thing for pyhf. We did a hack fix in pyhf PR 819 by running apt-get clean
on all of the Ubuntu jobs before trying to install anything (which worked) but this came up within the last week after never having been an issue in the past (and none of our dependencies changing size significantly).
FWIW the same thing is happening on 18.04. I have a Docker build that has been working for months that stopped working last week.
@jneira I'd totally be willing to believe /mnt isn't directly usable; like, I wasn't saying you could definitely solve the problem by using /mnt: what I was even saying is that I purposefully didn't try, as this was previously working and I expect that any temporary workaround to manually try to move something to /mnt, assuming it worked, would break later. Like, maybe it would be less confusing if I just talked about /dev/sdb1 instead of /mnt, as that partition could potentially have been mounted to /home or something before: I have no clue what the previous configuration of this system was where we had lots and lots more space was; but I've seen lots of people talk about these machines having 14GB of secondary disk available and GitHub saying we have at least 10 GB of storage for code and build outputs, yet for that all to be true we would have to be working on that secondary disk, not the primary disk we are currently defaulting to.
@alepauly, Does it make sense to extend disk space on Ubuntu 16.04/18.04 from 84gb to at least 128gb?
@alcheb - Unfortunately we can't easily do that at the time.
Thanks everyone(@saurik, @Chuxel, @matthewfeickert, @jneira, @rmetzger) for the input and apologies for the pain this is causing! Seems to me the simplest workaround is to apt clean
, which should be pretty fast and reclaim quite a bit of space.
We're looking for mitigations we can apply quickly and safely but in the meantime, adding the workaround manually is your best bet to unblock your workflows.
We'll post an update here as soon as we get something rolling.
I can agree with all you :) Our build was running smooth for couple weeks / months. We were using /mnt as docker root. We are running reallly simple minikube on runnners and now we are not able to even initialize first minikube log. Same disk usage about 91%. /mnt disk is for some reason not usable now.
/mnt disk is for some reason not usable now.
@obabec, we'll look into this.
Seems to me the simplest workaround is to
apt clean
, which should be pretty fast and reclaim quite a bit of space.
Thanks very much for the reply and status update @alepauly. This is useful to know how we can proceed for the time being (for the pyhf dev team we're already doing this). :+1:
I'll add that as a naive user to all the intricacies of what is actually happening, it was very strange that this is happening to only the ubuntu-latest
builds and that all of our macos-latest
builds are entirely unaffected. It was for this reason that the pyhf dev team viewed this as a bug on the GitHub side of things and not that we had somehow started exceeding our allotted space mysteriously.
@matthewfeickert fwiw, I would also view it as a bug - because you're not exceeding the space we make available, and because things all of a sudden changed and broke you. We try hard for this not to happen but it happens sometimes. We'll figure out a solution as soon as possible (probably will take a day or two to replicate everywhere) so you can remove the workaround.
We have the same issue with react-native
/ expo builds using https://github.com/expo/turtle. A decent amount of disk space is taken up by NPM dependencies, but a huge part is generated during the build (build tools, binaries, expo shell app, ...) that we don't have any control over: https://github.com/expo/turtle/issues/213
As app developers, we don't want to manage any VMs (i.e. self-hosted runners). Instead, we would prefer trading a few more build minutes for more disk space.💰💰💰
We'll figure out a solution as soon as possible (probably will take a day or two to replicate everywhere) so you can remove the workaround.
Sounds great. Thanks very much for responding quickly on this and for all the hard work!
We've started a rollback to the previous version of the VM images used for the virtual-environments in both Ubuntu 16.04 and Ubuntu 18.04. This should take from a few hours to a day, please keep us posted if you don't see mitigation after that. We'll continue working on the fixes so we can roll out the updates soon after the mitigation.
I reopened an orphaned issue: https://github.com/microsoft/azure-pipelines-image-generation/issues/1242 as #751
Also filed https://github.com/actions/virtual-environments/issues/752 to catch this ahead of time in the future
From this past week (so where we didn't have space):
Filesystem 1K-blocks Used Available Use% Mounted on
udev 3543000 0 3543000 0% /dev
tmpfs 711216 960 710256 1% /run
/dev/sda1 87218124 78537516 8664224 91% /
tmpfs 3556064 8 3556056 1% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
tmpfs 3556064 0 3556064 0% /sys/fs/cgroup
/dev/loop0 96128 96128 0 100% /snap/core/8935
/dev/loop1 40320 40320 0 100% /snap/hub/43
/dev/sda15 106858 3668 103190 4% /boot/efi
/dev/sdb1 14383048 40988 13591728 1% /mnt
After this new rollback (so what this was before last week):
Filesystem 1K-blocks Used Available Use% Mounted on
udev 3543004 0 3543004 0% /dev
tmpfs 711224 956 710268 1% /run
/dev/sda1 87218124 73920632 13281108 85% /
tmpfs 3556108 8 3556100 1% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
tmpfs 3556108 0 3556108 0% /sys/fs/cgroup
/dev/loop0 40320 40320 0 100% /snap/hub/43
/dev/loop1 96128 96128 0 100% /snap/core/8935
/dev/sda15 106858 3668 103190 4% /boot/efi
/dev/sdb1 14383048 40988 13591728 1% /mnt
OK, so notably we suddenly had 4.6GB less disk space on /.
The GitHub people probably already know what is going on, but for the rest of us trying to come up with theories, here is mine: last week, as part of https://github.com/actions/virtual-environments/commit/8abd45c3a8c2f777498e0df36853faf163558b15 (merging PR #711 closing issue #643), GitHub added a side-by-side installation of the Android NDK r20. These alternative NDK versions go in folders such as $ANDROID_HOME/ndk/20.0.5594570 and compliment the "default" NDK in $ANDROID_HOME/ndk-bundle that always tracks the latest release (which is currently r21). This alone used an additional 3.8GB (and then the rest was likely some other random stuff that got added or changed).
Given that there is supposedly 14GB of disk space sitting on /dev/sdb1 as part of the local disk, it would make the most sense to me (assuming this is possible, and given the person who said they were previously using /mnt successfully it sounds like it should be) to put the runner home directory there, so that the default code checkout process--as well as many of the "user" package managers, including from npm, dart, rust, and ghc--would put all of their files there: that disk feels more like "our space" than what is left over on / and is thereby likely going to be much less variable over time.
@obabec could you please provide more details about /mnt being inaccessible? I've tried to create files and copy directories there and everything worked for me(with sudo).
I'm hitting this issue consistently. My builds are running inside a container with a ~3g image, df reports ~7.5g of free space before I start the build, the build tree itself takes up ~1g when complete. But the build fails regularly with "no space left on device" when it should never be using anywhere close to the limit. If I was running directly on the VM then I could just run some apt cleanup steps to remove a bunch of packages I don't use and free up space but since the actions run in a container then AFAICT there's no way to run a cleanup step directly on the VM prior to the in-container steps; or am I wrong on that, is there a way to directly run some steps on the host and others in the container?
Perhaps it would be worth having a distinct container host image that's a bare bones install with just` enough to run docker containers instead of using the full blown 80g ubuntu image?
@chuckatkins Does it still happen after image rollback? There should be more than 10gb available now.
@saurik your analysis is pretty spot on 🎯 The work dir should be under /mnt, and we are planning to move it there. But because it's a breaking change we'll have to announce it first and give time to people that might depend on a hard coded /home/...
path to adjust.
@alepauly Why not symlink /home to /mnt/home, or mount /home to space on sdb1 (maybe using btrfs/lvm or whatever it is that let's you "share" partition space among multiple mount points), so this becomes more an implementation detail of the container than some filesystem change? FWIW, I would personally prefer to see folders like $HOME/.pub-cache (Dart) and $HOME/.cargo (Rust) end up on sdb1, not just the work directory (though maybe you consider all of /home/runner to be the "work directory").
(BTW, I just noticed that time is getting away from me--I am going to blame the lack of any real external time markers due to the pandemic, even though I honestly always had this problem ;P--and that commit to add the NDK "last week" happened after this bug was opened, apparently now well over a week ago; so my guess that it was the r20 NDK that added the brunt of the loss can't be right? Worse, that would mean that after that NDK would have been added--I say "would have been as I see that was now reverted"--we would be down to less than 5GB free on that partition... there really isn't that much disk space available to play with there for the base installed system software.)
@alepauly Why not symlink /home to /mnt/home
yeah, we were actually discussing doing bind mounts, the only minor risk is collision of anyone already using /mnt/<whatever>
. Thanks for sharing your ideas!
and that commit to add the NDK "last week" happened after this bug was opened, apparently now well over a week ago; so my guess that it was the r20 NDK that added the brunt of the loss can't be right?
Good point, @miketimofeev has been looking at it and he might have a better idea of other things that came in earlier and that affected space usage. We rather not revert a lot of what we already shipped since some workflows might have already taken a dependency, but since the NDK v20 was very recent and a major contributor we backed it out.
I've also seen this, see https://github.com/cachix/cachix-action/issues/43
Is there some progress to fix this? I'm currently deleting /opt
but getting a lot of support from people to workaround this issue.
@domenkozar , how much space do you have right now at the build start? It should be 14 GB. Please let us know if you have less
/dev/sda1 84G 71G 13G 86% /
Thanks! Based on documentation, runners should have 14 GB of free disk space and currently, we are deploying the new image where 14 GB should be available.
@maxim-lobanov What's the reason for /
to have 14GB and /mnt
to have 14GB? It would be preferable to have both combined on /
.
Hi everyone! We've switched working directory to /mnt and started to propagate the changes throughout the environments. It will take about 3-4 days.
@domenkozar Unfortunately, it's not possible in the current virtualization scheme.
@maxim-lobanov What's the reason for
/
to have 14GB and/mnt
to have 14GB? It would be preferable to have both combined on/
.
@domenkozar yeah, the reason is that the /mnt space is on a temp disk added to the vm, it's not the same device.
Unfortunately, we've faced unpredictable issues when /mnt is used as a working directory https://github.com/actions/virtual-environments/issues/922 We're going to rollback the changes today.
I'd love if you find a way to just increase root disk space, that would from user perspective be a win-win.
I do understand you have your own design restrictions, keeping my fingers crossed.
Unfortunately, we've faced unpredictable issues when /mnt is used as a working directory
922
We're going to rollback the changes today.
How long does the rollback operation take to complete? When can it take effect on Github actions?
@1orz it should take about 4-6 more hours
@miketimofeev From the linked GitHub Community Forum report, it seems like /home/runner/work got put on the new partition while /home/runner wasn't. Seeing this now (I failed to predict this), that does not surprise me as something that would break peoples' assumptions :(... being able to easily move files around within one's own $HOME using the rename() syscall seems very reasonable.
downloading kind from https://github.com/kubernetes-sigs/kind/releases/download/v0.8.1/kind-linux-amd64
chmod +x /home/runner/work/_temp/2eecb160-a46e-44cf-9217-e333cb604de4
##[error]EXDEV: cross-device link not permitted, rename '/home/runner/work/_temp/2eecb160-a46e-44cf-9217-e333cb604de4' -> '/home/runner/bin/kind'
My recommendation here would be to make /home/runner or /home be the mountpoint. I would be absolutely shocked if that broke anyone: /home is often mounted as its own partition on normal machines (so /home should definitely work), and there is a reasonable assumption that you can't access other peoples' home directories (and so can't be trying to move a file between them).
(FWIW, I've had /home/saurik on my machines on a separate partition from /home at least 15 years now without ever running into an issue like this, but seeing the error I'm like "ah yeah, ok: if I were to have a random subfolder of my home directory in a separate partition from the rest of it that would certainly end up driving me crazy and would break all kinds of things I do on a regular basis".)
@saurik thanks for your suggestions! The only reason we haven't done it with the home directory is that it's more complicated in terms of VM creation logic. We'll try to look at it one more time. Another thought is to move /swap on the /mnt.
I am closing this issue since currently, we have more than 14 GB on Ubuntu images and initial issue should be resolved. We will continue to work on release more free disk space on Ubuntu images to accept new feature requests. Please let us know if you have any concerns / suggestions
Describe the bug Jobs running in Ubuntu-16.04 images are starting to throw error related with low disk space
Area for Triage: Servers
Question, Bug, or Feature?: Bug
Virtual environments affected
Expected behavior The disk should have enough free space for custom user software needed for building.
Actual behavior
du -h
) is:Just before the error due to not enough disk space:
The concrete error is:
But it is not important, whatever command that needs disk space will fail