Open filipesilva opened 5 years ago
Sorry for the slow reply on this one, it looks like this one is caused by the lack of a .git
directory in the checkout directory 🤔 This is sometimes caused by a corrupted checkout, I thought we had checks specifically for that, but perhaps not. Let me investigate.
@lox is there anything that I can do to help?
We made some substantial changes to how we recover from checkout failures in 3.6.0, would you be able to update @filipesilva? We'll collaborate on any issues you see and try and get fixes out ASAP!
@lox re-provisioned our CI machine using the 3.6.1 agent and will be monitoring builds now to see if it happens again 👍
Heya @lox!
Just looked at our builds on master for the last 10 days. There were 24 builds, one of which failed to checkout: https://buildkite.com/angular/angular/builds/1190#17b7ee20-d02d-489b-a06f-30ae547bcd81
Seems to be a similar message as before:
(lots of these)
[2018-12-21T20:08:05Z] warning: failed to remove node_modules/xregexp: Directory not empty
--
| [2018-12-21T20:08:05Z] warning: failed to remove node_modules/yargs/node_modules: Directory not empty
| [2018-12-21T20:08:05Z] warning: failed to remove node_modules/zip-dir/node_modules/jszip: Directory not empty
| [2018-12-21T20:08:05Z] warning: failed to remove node_modules/zip-dir/node_modules/pako/lib: Directory not empty
| [2018-12-21T20:08:05Z] warning: failed to remove node_modules/zip-dir/test/fixtures/sampleZip/dir/deepDir: Directory not empty
| [2018-12-21T20:08:05Z] ⚠️ Warning: Checkout failed! Error running `C:\git\bin\git.exe clean -fxdq`: exit status 1 (Attempt 1/3 Retrying in 2s)
| [2018-12-21T20:08:05Z] # Removing C:\buildkite-agent\builds\gce-buildkite-windows-2-2\angular\angular
| [2018-12-21T20:08:13Z] > git remote set-url origin https://github.com/angular/angular
| [2018-12-21T20:08:13Z] fatal: not a git repository (or any of the parent directories): .git
| [2018-12-21T20:08:13Z] ⚠️ Warning: Checkout failed! Error running `C:\git\bin\git.exe remote set-url origin https://github.com/angular/angular`: exit status 128 (Attempt 2/3 Retrying in 2s)
| [2018-12-21T20:08:13Z] # Removing C:\buildkite-agent\builds\gce-buildkite-windows-2-2\angular\angular
| [2018-12-21T20:08:15Z] > git clone --config core.autocrlf=input --config core.eol=lf --config core.longpaths=true --config core.symlinks=true -- https://github.com/angular/angular .
| [2018-12-21T20:08:15Z] fatal: destination path '.' already exists and is not an empty directory.
| [2018-12-21T20:08:15Z] ⚠️ Warning: Checkout failed! Error running `C:\git\bin\git.exe clone --config core.autocrlf=input --config core.eol=lf --config core.longpaths=true --config core.symlinks=true -- https://github.com/angular/angular .`: exit status 128 (Attempt 3/3 Retrying in 2s)
| [2018-12-21T20:08:15Z] # Removing C:\buildkite-agent\builds\gce-buildkite-windows-2-2\angular\angular
| [2018-12-21T20:08:17Z] 🚨 Error: Error running `C:\git\bin\git.exe clone --config core.autocrlf=input --config core.eol=lf --config core.longpaths=true --config core.symlinks=true -- https://github.com/angular/angular .`: exit status 128
So I think we are still suffering from the failed checkout currently.
Thanks for that @filipesilva! Two issues going in in there, which I want to be careful to distinguish between.
git clean -fxdq
is failing with lots of warning: failed to remove node_modules/...: Directory not empty
errors.This will be the root cause. From some google searches, it looks like this happens when another process is still using that directory. The question is, what could still be using that directory? Apparently things like Visual Studio or TortoiseGit sometimes hang onto locks for longer than they should.
I have a few theories about what might be causing this. It could be that a bug in our cancelation process means that there are some processes hanging around that are still doing things with the node_modules/
directory. I'm doing some work to make cancellation more robust in https://github.com/buildkite/agent/pull/879 that should significantly help if this theory is right.
The other theory is that something outside of the agent flow is retaining locks on those directories.
If you were inclined, adding a debugging script in a pre-checkout.bat
script that ran Handle would be super helpful.
I think this would work:
@echo off
handle.exe /accepteula % BUILDKITE_BUILD_CHECKOUT_PATH%
In the output above, we can see the checkout failure, and then the message about removing the checkout:
# Removing C:\buildkite-agent\builds\gce-buildkite-windows-2-2\angular\angular
Then there are errors about the lack of a git repo:
fatal: not a git repository
This seems to be a bug with the "delete checkout on error before retry" logic we introduced in 3.6.0. I'll follow up on this.
There is definitely some weird stuff going on in there currently @filipesilva. Are you certain there aren't multiple buildkite-agent processes sharing the same checkout path? Can you tell us a bit more about your concurrency model?
Perhaps we should use our cross-process file locking to make double sure that agents aren't sharing a checkout dir. I'll investigate that too.
Thanks for getting back to me!
We use dedicated windows hosts for the BuildKite agents, with nothing else running on them. This script is used for the provisioning and it's pretty much following https://buildkite.com/docs/agent/v3/windows, except it starts a variable number of agents at the end.
I don't think there are other processes like Visual Studio hanging on because it's a dedicated machine.
I think each agent is using a different folder mostly because different build folders exist:
C:\buildkite-agent\builds>dir
Volume in drive C has no label.
Volume Serial Number is 8248-D80A
Directory of C:\buildkite-agent\builds
12/18/2018 01:06 PM <DIR> .
12/18/2018 01:06 PM <DIR> ..
12/18/2018 01:05 PM <DIR> gce-buildkite-windows-2-1
12/18/2018 01:05 PM <DIR> gce-buildkite-windows-2-2
12/18/2018 01:05 PM <DIR> gce-buildkite-windows-2-3
12/18/2018 01:05 PM <DIR> gce-buildkite-windows-2-4
12/18/2018 01:06 PM <DIR> gce-buildkite-windows-2-5
12/18/2018 01:06 PM <DIR> gce-buildkite-windows-2-6
12/18/2018 01:06 PM <DIR> gce-buildkite-windows-2-7
12/18/2018 01:06 PM <DIR> gce-buildkite-windows-2-8
0 File(s) 0 bytes
10 Dir(s) 110,628,233,216 bytes free
There's 8 of these directories, but we only have 4 agents running:
I remember there was a point at which I restarted the machine and the old agent processes had not yet timed out, so new directories were taken for the new processes.
Should I be using different agent installs for different agents?
I can add that pre-checkout.bat
, but can you tell me how it's meant to be added? Is it on the pipeline config?
Might the failure have been just after you cancelled? I wonder if there were some processes hanging around.
I will do some more testing on my side and let you know about the pre-checkout.bat
.
During this period no build was manually cancelled, but we do have the automatic cancel on new commits turned on (the Skip pull request builds for existing commits
option).
Heya @lox, did another status check today: 5 checkout failures out of 48 builds on the master branch. There are other checkout failures on other branches but I didn't list check all of them.
4 of those were during pipeline update, 1 was during the subsequent docker plugin step. I looked into the surrounding builds in those agents. Could only see one since the jobs list for a given machine did not allow me to go beyond a certain point.
For 1465 and 1467 on gce-buildkite-windows-2-3:
windows-test
windows-test
, preceded by warning: failed to remove ...
after git clean -fxdq
windows-test
pipeline upload
pipeline upload
In this particular case it looks like the checkout failures were immediately preceded by an automatically cancelled build. 4 builds in 2 minutes right after that one failed on checkout, but only the first of them listed warning: failed to remove ...
errors. The next build was more than an hour away and checked out successfully.
The first build was automatically cancelled, then 20s later the following build shows warnings on git clean
and fails on git clean
. The next builds don't do git clean
, just fail on git clone
.
If I had to bet I'd say the automatically canceled build still kept some opened files on the docker volume, which caused the next builds to fail. Perhaps the kill command was sent to docker but the container did not immediately exit. Locally I've seen that happen on some occasions on when I tried killing a windows docker container during long-running operations.
I am also receiving this fwiw
@dbousamra are you also running Windows machines?
@filipesilva I think you are 100% on the money with:
If I had to bet I'd say the automatically canceled build still kept some opened files on the docker volume, which caused the next builds to fail.
I've made some major changes to how we handle signals and cancellation under windows. They are in master and I'm going to release them soon. My hope is that we will be much more effective at canceling jobs and terminating processes which should solve this problem.
I'm also going to add some locking on the checkout directory so we know for sure!
@filipesilva No i'm on Linux. I think I'm getting there git clean permissions error because Docker is writing files using a diff user to the one buildkite is using.
Yup, completely unrelated issue I am afraid @dbousamra. Presumably yours are permission denied errors?
Yeah perms denied. Sorry for confusion. Presumably this issue: https://github.com/buildkite/agent/issues/407
Yup exactly, checkout my comment in https://github.com/buildkite/agent/issues/407#issuecomment-370573431 @dbousamra for some tips, otherwise, drop us an email at support@buildkite.com.
Very interested to see if this crops up again with the new agent changes @filipesilva!
@lox updating to 3.8.2
now, thanks for the ping!
@lox heya just wanted to give you a quick update. Still seeing the same checkout errors, and they definitely seem related to a cancelled build. Using v3.8.2 agent at the moment.
I was able to capture a good timeline of this happening for our gce-buildkite-windows-4-1
agent (time in GMT AFAICT):
The important builds are:
warning: failed to remove node_modules
warnings, proceeds to fail checkoutEvery build after that one fails checkout. Failed builds span about 10 minutes. No other builds recorded on that agent as of now.
Thanks for the update @filipesilva, I'm sorry you are still having issues. I wonder if perhaps this is related to https://github.com/git-for-windows/git/issues/521?
I notice you have --config core.autocrlf=input --config core.eol=lf --config core.longpaths=true --config core.symlinks=true
set for your clone flags, what if you added that to the git-clean-flags
too?
@lox I don't think those flags would do anything when added to git clean
though. Setting the flags on clone will persist them into .git/config
and be used throughout usage of that repository.
I tested this by cloning the same git repository twice, once with flags and once without, and then checking the contents of .git/config
:
// with flags
[core]
repositoryformatversion = 0
filemode = false
bare = false
logallrefupdates = true
ignorecase = true
autocrlf = input
eol = lf
longpaths = true
symlinks = true
// without flags
[core]
repositoryformatversion = 0
filemode = false
bare = false
logallrefupdates = true
ignorecase = true
Can anyone help me in this please
-- | 🚨 Error: Failed to remove "/var/lib/buildkite-agent/builds/fandom-connect/test-dev-spring-service" (unlinkat /var/lib/buildkite-agent/builds/test-dev-spring-service/config/postgres: permission denied)
leaving a comment here in case it helps others. For our builds, the underlying issue was the way Go sets folder permissions (unwritable by default) when fetching modules. See here for more. In short, the solution is to use the -modcacherw
flag for go get
, go build
, etc...
Heya,
We've recently setup a windows pipeline (https://buildkite.com/angular/angular) and have observed some odd errors (log [1](https://buildkite.com/angular/angular/builds/70#9fc8283b-7088-4aee-9582-53166c9a8d3d], 2, 3, 3). The pipeline is public so anyone should be able to see the builds.
All of these fail in the
Preparing working directory
in a similar pattern: somehow the folder is no longer a git repository and the retries fail:It doesn't always happen and subsequent builds of the same code can succeed (see 67 and then 86.
There isn't much for us to go on here since our build steps never really run. This all happens in the agent repository cloning code.
@lox @toolmantim this seems like a more severe version of what I saw a couple of weeks ago (https://github.com/buildkite-plugins/docker-buildkite-plugin/issues/85#issuecomment-441130155), but this time the retries did not work.