buildkite / agent

The Buildkite Agent is an open-source toolkit written in Go for securely running build jobs on any device or network
https://buildkite.com/
MIT License
810 stars 300 forks source link

Buildkite checkout fails, and deletes most of the repo, if any user-owned files in the checkout path are non-writable #1368

Open ccarpita opened 3 years ago

ccarpita commented 3 years ago

Agent User: buildkite-agent Example Checkout path: /var/lib/buildkite-agent/software Read-only File: /var/lib/buildkite-agent/software/tmp/file

ls -l /var/lib/buildkite-agent/software/tmp/file
-r-xr-xr-x 1 buildkite-agent buildkite-agent 186824 Jan 11 11:49 /var/lib/buildkite-agent/software/tmp/file

In our case, a previous build rsyncs to the checkout directory, and writes tmp/file. While the owner is buildkite-agent, the permissions are set as u+rx and u-w, aka the file is "read-only".

On the next build will fail to git clean, and then it will try to delete the whole repo (and fail), and then try to clone (and fail). However, the user actually owns those files, and buildkite-agent could simply chmod o+w any files that it tries to delete, or rm -f <file>. Assuming buildkite-agent is using a filesystem interface in golang, it's likely some argument would have to change for it to delete write-only files.

yob commented 3 years ago

Huh, that does sound awkward. We'll take a look at reproducing it and see how we go.

In the interim, a hook (maybe an in-repo one?) that fixes the permissions prior to the job starting could be a good workaround.

marwanhilmi commented 3 years ago

We are constantly encountering this issue on Windows machines where the initial git checkout process is overzealous and ends up nuking the entire repository after a permissions failure. Permissions on Windows are notoriously annoying and a dangling process (or even an open Explorer window) ends up holding a lock on the folder causing the agent to try deleting the entire repository. We are working with 500+ GB repository and the effect of this is devastating.

⚠️ Warning: Checkout failed! exit status 3221225786 (Attempt 1/3 Retrying in 2s)
# Removing C:\builds\builder-1\org\repo
chloeruka commented 3 years ago

@marwanhilmi oof, sorry about that. Handling huge GitHub repos is particularly challenging and something we've tried to make a pass at making better, and it's on our radar for improving soon.

I'd recommend trying to reduce the size of the repo by performing a shallow clone, there's some more context @sj26 wrote over in https://github.com/buildkite/agent/issues/437#issuecomment-708068425

It also would be worth checking out the git mirrors experiment flags to see if that will help. It can be used to set up a local cache of the git repo so that checkouts can be faster. https://buildkite.com/docs/agent/v3/configuration#git-mirrors-path

stephenlacy commented 3 years ago

Would a flag be accepted? BUILDKITE_EXIT_ON_CHECKOUT_FAILURE=true https://github.com/stevelacy/agent/commit/4db6f24376ec8f63191ac707747e93f9235e796c

We are using LFS - even a shallow clone will be huge.

chloeruka commented 3 years ago

🤔 we could discuss it. I think the preferred approach in that situation would be to just override the checkout hook as @yob suggested so that it doesn't purge the directory, is there a reason why that wouldn't work here?