buildkite / agent

The Buildkite Agent is an open-source toolkit written in Go for securely running build jobs on any device or network
https://buildkite.com/
MIT License
806 stars 295 forks source link

Repeated `Warning: Checkout failed` errors while preparing working directory #866

Open filipesilva opened 5 years ago

filipesilva commented 5 years ago

Heya,

We've recently setup a windows pipeline (https://buildkite.com/angular/angular) and have observed some odd errors (log [1](https://buildkite.com/angular/angular/builds/70#9fc8283b-7088-4aee-9582-53166c9a8d3d], 2, 3, 3). The pipeline is public so anyone should be able to see the builds.

All of these fail in the Preparing working directory in a similar pattern: somehow the folder is no longer a git repository and the retries fail:

[2018-12-06T07:58:04Z] > cd C:\buildkite-agent\builds\gce-buildkite-windows-1-3\angular\angular
--
  | [2018-12-06T07:58:04Z] > git remote set-url origin https://github.com/angular/angular
  | [2018-12-06T07:58:04Z] fatal: not a git repository (or any of the parent directories): .git
  | [2018-12-06T07:58:04Z] # Removing C:\buildkite-agent\builds\gce-buildkite-windows-1-3\angular\angular
  | [2018-12-06T07:58:04Z] ⚠️ Warning: Checkout failed! Error running `C:\git\cmd\git.exe remote set-url origin https://github.com/angular/angular`: exit status 128 (Attempt 1/3 Retrying in 2s)
  | [2018-12-06T07:58:06Z] > git remote set-url origin https://github.com/angular/angular
  | [2018-12-06T07:58:06Z] fatal: not a git repository (or any of the parent directories): .git
  | [2018-12-06T07:58:06Z] # Removing C:\buildkite-agent\builds\gce-buildkite-windows-1-3\angular\angular
  | [2018-12-06T07:58:06Z] ⚠️ Warning: Checkout failed! Error running `C:\git\cmd\git.exe remote set-url origin https://github.com/angular/angular`: exit status 128 (Attempt 2/3 Retrying in 2s)
  | [2018-12-06T07:58:08Z] > git clone --config core.autocrlf=input --config core.eol=lf --config core.longpaths=true --config core.symlinks=true -- https://github.com/angular/angular .
  | [2018-12-06T07:58:08Z] fatal: destination path '.' already exists and is not an empty directory.
  | [2018-12-06T07:58:08Z] # Removing C:\buildkite-agent\builds\gce-buildkite-windows-1-3\angular\angular
  | [2018-12-06T07:58:08Z] ⚠️ Warning: Checkout failed! Error running `C:\git\cmd\git.exe clone --config core.autocrlf=input --config core.eol=lf --config core.longpaths=true --config core.symlinks=true -- https://github.com/angular/angular .`: exit status 128 (Attempt 3/3 Retrying in 2s)
  | [2018-12-06T07:58:10Z] 🚨 Error: Error running `C:\git\cmd\git.exe clone --config core.autocrlf=input --config core.eol=lf --config core.longpaths=true --config core.symlinks=true -- https://github.com/angular/angular .`: exit status 128

It doesn't always happen and subsequent builds of the same code can succeed (see 67 and then 86.

There isn't much for us to go on here since our build steps never really run. This all happens in the agent repository cloning code.

@lox @toolmantim this seems like a more severe version of what I saw a couple of weeks ago (https://github.com/buildkite-plugins/docker-buildkite-plugin/issues/85#issuecomment-441130155), but this time the retries did not work.

lox commented 5 years ago

Sorry for the slow reply on this one, it looks like this one is caused by the lack of a .git directory in the checkout directory 🤔 This is sometimes caused by a corrupted checkout, I thought we had checks specifically for that, but perhaps not. Let me investigate.

filipesilva commented 5 years ago

@lox is there anything that I can do to help?

lox commented 5 years ago

We made some substantial changes to how we recover from checkout failures in 3.6.0, would you be able to update @filipesilva? We'll collaborate on any issues you see and try and get fixes out ASAP!

filipesilva commented 5 years ago

@lox re-provisioned our CI machine using the 3.6.1 agent and will be monitoring builds now to see if it happens again 👍

filipesilva commented 5 years ago

Heya @lox!

Just looked at our builds on master for the last 10 days. There were 24 builds, one of which failed to checkout: https://buildkite.com/angular/angular/builds/1190#17b7ee20-d02d-489b-a06f-30ae547bcd81

Seems to be a similar message as before:

(lots of these)
[2018-12-21T20:08:05Z] warning: failed to remove node_modules/xregexp: Directory not empty
--
  | [2018-12-21T20:08:05Z] warning: failed to remove node_modules/yargs/node_modules: Directory not empty
  | [2018-12-21T20:08:05Z] warning: failed to remove node_modules/zip-dir/node_modules/jszip: Directory not empty
  | [2018-12-21T20:08:05Z] warning: failed to remove node_modules/zip-dir/node_modules/pako/lib: Directory not empty
  | [2018-12-21T20:08:05Z] warning: failed to remove node_modules/zip-dir/test/fixtures/sampleZip/dir/deepDir: Directory not empty
  | [2018-12-21T20:08:05Z] ⚠️ Warning: Checkout failed! Error running `C:\git\bin\git.exe clean -fxdq`: exit status 1 (Attempt 1/3 Retrying in 2s)
  | [2018-12-21T20:08:05Z] # Removing C:\buildkite-agent\builds\gce-buildkite-windows-2-2\angular\angular
  | [2018-12-21T20:08:13Z] > git remote set-url origin https://github.com/angular/angular
  | [2018-12-21T20:08:13Z] fatal: not a git repository (or any of the parent directories): .git
  | [2018-12-21T20:08:13Z] ⚠️ Warning: Checkout failed! Error running `C:\git\bin\git.exe remote set-url origin https://github.com/angular/angular`: exit status 128 (Attempt 2/3 Retrying in 2s)
  | [2018-12-21T20:08:13Z] # Removing C:\buildkite-agent\builds\gce-buildkite-windows-2-2\angular\angular
  | [2018-12-21T20:08:15Z] > git clone --config core.autocrlf=input --config core.eol=lf --config core.longpaths=true --config core.symlinks=true -- https://github.com/angular/angular .
  | [2018-12-21T20:08:15Z] fatal: destination path '.' already exists and is not an empty directory.
  | [2018-12-21T20:08:15Z] ⚠️ Warning: Checkout failed! Error running `C:\git\bin\git.exe clone --config core.autocrlf=input --config core.eol=lf --config core.longpaths=true --config core.symlinks=true -- https://github.com/angular/angular .`: exit status 128 (Attempt 3/3 Retrying in 2s)
  | [2018-12-21T20:08:15Z] # Removing C:\buildkite-agent\builds\gce-buildkite-windows-2-2\angular\angular
  | [2018-12-21T20:08:17Z] 🚨 Error: Error running `C:\git\bin\git.exe clone --config core.autocrlf=input --config core.eol=lf --config core.longpaths=true --config core.symlinks=true -- https://github.com/angular/angular .`: exit status 128

So I think we are still suffering from the failed checkout currently.

lox commented 5 years ago

Thanks for that @filipesilva! Two issues going in in there, which I want to be careful to distinguish between.

1. git clean -fxdq is failing with lots of warning: failed to remove node_modules/...: Directory not empty errors.

This will be the root cause. From some google searches, it looks like this happens when another process is still using that directory. The question is, what could still be using that directory? Apparently things like Visual Studio or TortoiseGit sometimes hang onto locks for longer than they should.

I have a few theories about what might be causing this. It could be that a bug in our cancelation process means that there are some processes hanging around that are still doing things with the node_modules/ directory. I'm doing some work to make cancellation more robust in https://github.com/buildkite/agent/pull/879 that should significantly help if this theory is right.

The other theory is that something outside of the agent flow is retaining locks on those directories.

If you were inclined, adding a debugging script in a pre-checkout.bat script that ran Handle would be super helpful.

I think this would work:

@echo off
handle.exe /accepteula % BUILDKITE_BUILD_CHECKOUT_PATH%

2. Bug where bootstrap doesn't re-clone checkout dir after error 🤔

In the output above, we can see the checkout failure, and then the message about removing the checkout:

# Removing C:\buildkite-agent\builds\gce-buildkite-windows-2-2\angular\angular

Then there are errors about the lack of a git repo:

fatal: not a git repository

This seems to be a bug with the "delete checkout on error before retry" logic we introduced in 3.6.0. I'll follow up on this.

lox commented 5 years ago

There is definitely some weird stuff going on in there currently @filipesilva. Are you certain there aren't multiple buildkite-agent processes sharing the same checkout path? Can you tell us a bit more about your concurrency model?

lox commented 5 years ago

Perhaps we should use our cross-process file locking to make double sure that agents aren't sharing a checkout dir. I'll investigate that too.

filipesilva commented 5 years ago

Thanks for getting back to me!

We use dedicated windows hosts for the BuildKite agents, with nothing else running on them. This script is used for the provisioning and it's pretty much following https://buildkite.com/docs/agent/v3/windows, except it starts a variable number of agents at the end.

I don't think there are other processes like Visual Studio hanging on because it's a dedicated machine.

I think each agent is using a different folder mostly because different build folders exist:

C:\buildkite-agent\builds>dir
 Volume in drive C has no label.
 Volume Serial Number is 8248-D80A

 Directory of C:\buildkite-agent\builds

12/18/2018  01:06 PM    <DIR>          .
12/18/2018  01:06 PM    <DIR>          ..
12/18/2018  01:05 PM    <DIR>          gce-buildkite-windows-2-1
12/18/2018  01:05 PM    <DIR>          gce-buildkite-windows-2-2
12/18/2018  01:05 PM    <DIR>          gce-buildkite-windows-2-3
12/18/2018  01:05 PM    <DIR>          gce-buildkite-windows-2-4
12/18/2018  01:06 PM    <DIR>          gce-buildkite-windows-2-5
12/18/2018  01:06 PM    <DIR>          gce-buildkite-windows-2-6
12/18/2018  01:06 PM    <DIR>          gce-buildkite-windows-2-7
12/18/2018  01:06 PM    <DIR>          gce-buildkite-windows-2-8
               0 File(s)              0 bytes
              10 Dir(s)  110,628,233,216 bytes free

There's 8 of these directories, but we only have 4 agents running: image

I remember there was a point at which I restarted the machine and the old agent processes had not yet timed out, so new directories were taken for the new processes.

Should I be using different agent installs for different agents?

I can add that pre-checkout.bat, but can you tell me how it's meant to be added? Is it on the pipeline config?

lox commented 5 years ago

Might the failure have been just after you cancelled? I wonder if there were some processes hanging around.

I will do some more testing on my side and let you know about the pre-checkout.bat.

filipesilva commented 5 years ago

During this period no build was manually cancelled, but we do have the automatic cancel on new commits turned on (the Skip pull request builds for existing commits option).

filipesilva commented 5 years ago

Heya @lox, did another status check today: 5 checkout failures out of 48 builds on the master branch. There are other checkout failures on other branches but I didn't list check all of them.

4 of those were during pipeline update, 1 was during the subsequent docker plugin step. I looked into the surrounding builds in those agents. Could only see one since the jobs list for a given machine did not allow me to go beyond a certain point.

For 1465 and 1467 on gce-buildkite-windows-2-3:

In this particular case it looks like the checkout failures were immediately preceded by an automatically cancelled build. 4 builds in 2 minutes right after that one failed on checkout, but only the first of them listed warning: failed to remove ... errors. The next build was more than an hour away and checked out successfully.

The first build was automatically cancelled, then 20s later the following build shows warnings on git clean and fails on git clean. The next builds don't do git clean, just fail on git clone.

If I had to bet I'd say the automatically canceled build still kept some opened files on the docker volume, which caused the next builds to fail. Perhaps the kill command was sent to docker but the container did not immediately exit. Locally I've seen that happen on some occasions on when I tried killing a windows docker container during long-running operations.

dbousamra commented 5 years ago

I am also receiving this fwiw

filipesilva commented 5 years ago

@dbousamra are you also running Windows machines?

lox commented 5 years ago

@filipesilva I think you are 100% on the money with:

If I had to bet I'd say the automatically canceled build still kept some opened files on the docker volume, which caused the next builds to fail.

I've made some major changes to how we handle signals and cancellation under windows. They are in master and I'm going to release them soon. My hope is that we will be much more effective at canceling jobs and terminating processes which should solve this problem.

I'm also going to add some locking on the checkout directory so we know for sure!

dbousamra commented 5 years ago

@filipesilva No i'm on Linux. I think I'm getting there git clean permissions error because Docker is writing files using a diff user to the one buildkite is using.

lox commented 5 years ago

Yup, completely unrelated issue I am afraid @dbousamra. Presumably yours are permission denied errors?

dbousamra commented 5 years ago

Yeah perms denied. Sorry for confusion. Presumably this issue: https://github.com/buildkite/agent/issues/407

lox commented 5 years ago

Yup exactly, checkout my comment in https://github.com/buildkite/agent/issues/407#issuecomment-370573431 @dbousamra for some tips, otherwise, drop us an email at support@buildkite.com.

lox commented 5 years ago

Very interested to see if this crops up again with the new agent changes @filipesilva!

filipesilva commented 5 years ago

@lox updating to 3.8.2 now, thanks for the ping!

filipesilva commented 5 years ago

@lox heya just wanted to give you a quick update. Still seeing the same checkout errors, and they definitely seem related to a cancelled build. Using v3.8.2 agent at the moment.

I was able to capture a good timeline of this happening for our gce-buildkite-windows-4-1 agent (time in GMT AFAICT):

image

The important builds are:

Every build after that one fails checkout. Failed builds span about 10 minutes. No other builds recorded on that agent as of now.

lox commented 5 years ago

Thanks for the update @filipesilva, I'm sorry you are still having issues. I wonder if perhaps this is related to https://github.com/git-for-windows/git/issues/521?

I notice you have --config core.autocrlf=input --config core.eol=lf --config core.longpaths=true --config core.symlinks=true set for your clone flags, what if you added that to the git-clean-flags too?

filipesilva commented 5 years ago

@lox I don't think those flags would do anything when added to git clean though. Setting the flags on clone will persist them into .git/config and be used throughout usage of that repository.

I tested this by cloning the same git repository twice, once with flags and once without, and then checking the contents of .git/config:

// with flags
[core]
    repositoryformatversion = 0
    filemode = false
    bare = false
    logallrefupdates = true
    ignorecase = true
    autocrlf = input
    eol = lf
    longpaths = true
    symlinks = true
// without flags
[core]
    repositoryformatversion = 0
    filemode = false
    bare = false
    logallrefupdates = true
    ignorecase = true
Mohammadaraz commented 4 years ago

Can anyone help me in this please

Removing /var/lib/buildkite-agent/builds/test-dev-spring-service

--   | 🚨 Error: Failed to remove "/var/lib/buildkite-agent/builds/fandom-connect/test-dev-spring-service" (unlinkat /var/lib/buildkite-agent/builds/test-dev-spring-service/config/postgres: permission denied)

ivotron commented 3 years ago

leaving a comment here in case it helps others. For our builds, the underlying issue was the way Go sets folder permissions (unwritable by default) when fetching modules. See here for more. In short, the solution is to use the -modcacherw flag for go get, go build, etc...