buildkite-plugins / docker-buildkite-plugin

🐳📦 Run any build step in a Docker container
MIT License
113 stars 106 forks source link

Multiple builds on the same host modify the same files #85

Closed filipesilva closed 6 years ago

filipesilva commented 6 years ago

When using this plugin, all the builds for the same repository are checked out in the same directory:

Preparing working directory | 46s
-- | --
  | > cd C:\buildkite-agent\builds\gce-buildkite-windows-1-1\angular\angular
  | > git remote set-url origin https://github.com/angular/angular
  | > git clean -fxdq

After checkout, the repository root will be mounted on a Docker container using the --volume flag, which shares file changes between the container and the host file systems.

If there is a build already running, and a second build is triggered, the new commit will be checked out on the same directory. The already running build will have its code changed in mid-build.

At best it will test the wrong code, and at worst it will crash the build or other unexpected behaviour.

Even with Cancel Intermediate Builds turned on, this can cause odd behaviour as files can still be locked while the previous build is being cancelled. This is especially noticeable on windows where locked files/directories cannot be deleted.

Regardless, multiple running builds should not interfere with each other. It is common to have multiple builds running at any given time (e.g. two PRs opened close together).

toolmantim commented 6 years ago

Sorry you’re having trouble there! Do the two agents have different names? Because each agent should have its own checkout dir, but it’s based on having unique agent names (which is why %n is included in the default name). As far as I know, the volume mounts in this plugin shouldn’t mess it up, as long as those agent names are unique.

It looks like this one was gce-buildkite-windows-1-1. Do you know what the second agent’s checkout/build directory and name was?

filipesilva commented 6 years ago

Heya @toolmantim, thanks for getting back to me!

It's not different agents though, it's the same agent. I have 1 agent, running on 1 host, and am pushing builds to 1 branch on github.

When I trigger a build via a commit it will checkout in C:\buildkite-agent\builds\gce-buildkite-windows-1-1\angular\angular then proceed to use that folder as a volume in Docker.

Then, if I trigger another build while the first one is still running, the same happens. Since both builds are using the same folder, they are sharing the files. While the first build is running, its files will be updated with the contents of the second checkout.

That this happens actually sounds a bit odd to me. I can't imagine I am the only person running concurrent builds on the same agent. I wonder if I'm doing something wrong here.

filipesilva commented 6 years ago

Now that I think about it... is the same agent ever supposed to run concurrent builds? I was looking at it from the perspective of docker, and of the isolation provided by it.

But if the agent is supposed to be the unit of isolation, then I should only have concurrent builds by having multiple agents in the same machine. Is that how they are supposed to be used?

toolmantim commented 6 years ago

Ah, thanks for the details!

Yep, an agent can only run 1 job at a time, so that situation you’re describing should never happen. Sorry I didn’t make that clearer.

But if you are getting an unexpected error, we can look into it! Did you want to email the details of the builds to support@buildkite.com?

toolmantim commented 6 years ago

I should mention that agents are cheap, resource wise, and are designed to be run alongside one another. So if you wanna spin up multiple per host to increase concurrency, no problems there.

filipesilva commented 6 years ago

I saw a bunch of errors like this on build start:


> cd C:\buildkite-agent\builds\gce-buildkite-windows-1-1\angular\angular
--
  | > git remote set-url origin https://github.com/angular/angular
  | > git clean -fxdq
  | warning: failed to remove node_modules/@angular-devkit/core/node: Directory not empty
  | warning: failed to remove node_modules/@angular-devkit/core/node_modules/expand-brackets/node_modules: Directory not empty
  | warning: failed to remove node_modules/@angular-devkit/core/node_modules/extglob/node_modules: Directory not empty
  | warning: failed to remove node_modules/@angular-devkit/core/node_modules/glob-parent/node_modules: Directory not empty
  | warning: failed to remove node_modules/@angular-devkit/core/node_modules/is-accessor-descriptor/node_modules: Directory not empty
  | warning: failed to remove node_modules/@angular-devkit/core/node_modules/is-data-descriptor/node_modules: Directory not empty
  | warning: failed to remove node_modules/@angular-devkit/core/src: Directory not empty
  | warning: failed to remove node_modules/@angular-devkit/schematics/src: Directory not empty
  | warning: failed to remove node_modules/@angular-devkit/schematics/tasks/tslint-fix: Directory not empty
  | warning: failed to remove node_modules/@angular-devkit/schematics/tools: Directory not empty
  | warning: failed to remove node_modules/@bazel/bazel/node_modules: Directory not empty
  | warning: failed to remove node_modules/@bazel/bazel-win32_x64/bazel-0.18.0-windows-x86_64.exe: Invalid argument
  | warning: failed to remove node_modules/@bazel/ibazel/bin: Directory not empty

At the time I had Cancel Intermediate Builds and knew this sort of error from local manual executions as something that happens when trying to delete folders that are still in use. The errors cleared on a automatic retry:

# Removing C:\buildkite-agent\builds\gce-buildkite-windows-1-1\angular\angular
--
  | ⚠️ Warning: Checkout failed! Error running `C:\git\cmd\git.exe clean -fxdq`: exit status 1 (Attempt 1/3 Retrying in 2s)
  | # Creating "C:\buildkite-agent\builds\gce-buildkite-windows-1-1\angular\angular"

But now that I know that a single agent is meant to run only one build at a time, that makes more sense. Windows is finicky with file locks so it's not a huge surprise that it kept them longer than it should.

I think I have no problem now that I understand the model better. In my head the single agent was coordinating several docker image runs so the single folder would be a problem.

Sorry for the noise!

toolmantim commented 6 years ago

Ahhh, that makes sense. No problems at all! Docker problems (and windows file locks, it turns out) are super tricky to debug.

I wonder if we can improve that git clean behaviour? Timing problems are the worst.

lox commented 6 years ago

@filipesilva did the git clean operating retry for you, or did it fail the build? Generally we expect retries to handle things like hanging locks. Visual Studio leaves lots of those too :(

filipesilva commented 6 years ago

@lox the retry sorted it out, yes!

cfriedt commented 4 years ago

I'm actually running into the same issue right now.

The "root" of the problem (pun intended) is that the buildkite agent is run on the host operating system with the user id and permissions of the buildkite-agent user when it checks out code.

E.g. on my machine, the buildkite-agent has user:group buildkite-agent:buildkite-agent, or more precisely, 997:996.

However, in the docker environment, the default user:group is root:root, or more precisely, 0:0.

Unfortunately, this means that any files created by the build process in the container will be created with permissions associated with the root account in the host environment. So if there are files left-over by root from a previous build, then they cannot be deleted.

I'm not quite sure how to fix it.

My buildkite agent version is buildkite-agent version 3.22.1, build x