buildkite / agent

The Buildkite Agent is an open-source toolkit written in Go for securely running build jobs on any device or network
https://buildkite.com/
MIT License
812 stars 300 forks source link

Build steps hang on Windows if child processes still running #2202

Open brianseeders opened 1 year ago

brianseeders commented 1 year ago

I have the following script used in a Buildkite step (for reproduction purposes):

echo 1
sleep 60 &
echo 2

Note that sleep 60 & goes to the background, so this script exits immediately.

On linux, when running this script in a Buildkite step, the step finishes immediately (not after waiting 60s), as I would expect.

On Windows, however, the buildkite step hangs for 60 seconds, waiting for the child process to finish, even though the parent process completed.

It doesn't matter what shell is specified (I've tried powershell, pwsh, bash, and no shell (which defaults to cmd)), the behavior is always the same. The script exits immediately if run on Windows outside of Buildkite. This all leads me to believe it's the Buildkite agent itself and how it manages processes.

Is this difference expected behavior? We have complex pipelines that spawn a lot of child processes (for example, a simple case is gradle daemons) and they hang indefinitely on Windows.

Is there a way around the behavior? I can't think of anything related to our environment that could cause this.

You can also run, for example, in batch: bash.exe -c 'echo 1; sleep 60 ^& echo 2;'

Tested on: Windows 2022, 2019, 2016 buildkite-agent 3.48.0 and 3.49.0

Note that I e-mailed support, and they asked me to open an issue here.

brianseeders commented 1 year ago

Also, in order to rule out any issues related to our environment/images, I created a VM in GCP using their base Windows Server 2022 image. I installed chocolatey, git, bash, and buildkite-agent. I connected buildkite-agent to our org and ran my job, and it still hangs. I also tested buildkite-agent v3.0.0, and it still happens on this very old version.

moskyb commented 1 year ago

g'day @brianseeders! thanks for this - it definitely seems like something hinky is going on, we're gonna take a look into it.

thinking out loud, we don't run jobs in PTYs on windows, which seems like it could be the source of this - perhaps we could run in PTY mode on windows iff bash is the shell we're in? will think about this a bit more.

brianseeders commented 1 year ago

Thanks. I'm guessing based on this that it won't be particularly easy to try this?

I've been trying to find a workaround for this. I started experimenting with this: https://github.com/elastic/elasticsearch/blob/cec2769216409fc143cb05048f9ecd0fedc4341a/.buildkite/scripts/windows-end-job.ps1

I'm basically ending every step by sending ctrl-c to the agent process twice. It only works because we're using ephemeral one-shot agents. It works, but the exit status isn't being reported correctly by Buildkite for certain steps and I haven't been able to figure out why.

I have a powershell script that looks like this at the end:

          echo "Exiting with $exitCode"
          exit $exitCode

and the Buildkite log shows Exiting with 1, but the step is successful sometimes.

I'll try the latest release as I see there has been a ton of work/refactoring happening, and report back.

brianseeders commented 1 year ago

Nope, the exit code issue is still present on the latest agent as well. Job uuid 01898929-e64a-4306-90c9-2cffe43c1b2b if you'd like to see. It reproduces reliably for this job, but the job takes 90 minutes. I'm trying to come up with a smaller example for it.

brianseeders commented 1 year ago

Ending my day here, and wanted to give one more update. It turns out that the real exit code is lost when the job has to be forcibly terminated (e.g. when you see this: Job 01898929-e64a-4306-90c9-2cffe43c1b2b hasn't stopped in time, terminating). It always shows as successful. Graceful exits will capture the exit code correctly.

I'm guessing the hanging issue itself centers around the job object / process group stuff, but I've been trying to understand why. Maybe golang's cmd.Wait() ends up waiting for the process job itself to complete? I'm not sure.

brianseeders commented 1 year ago

Another update here. I finally seem to have figured out a workaround, with an idea from @rjernst.

https://github.com/elastic/elasticsearch/blob/buildkite-migration/.buildkite/scripts/run-script.ps1

I'm creating my own nested job, and closing it when the main script finishes executing. This cleans up any lingering processes, which allows the buildkite-agent to move on.

triarius commented 1 year ago

Thanks for sharing the workaround here @brianseeders 💖

almeidathomas92 commented 3 months ago

I have the following script used in a Buildkite step (for reproduction purposes):

echo 1
sleep 60 &
echo 2

Note that sleep 60 & goes to the background, so this script exits immediately.

On linux, when running this script in a Buildkite step, the step finishes immediately (not after waiting 60s), as I would expect.

On Windows, however, the buildkite step hangs for 60 seconds, waiting for the child process to finish, even though the parent process completed.

It doesn't matter what shell is specified (I've tried powershell, pwsh, bash, and no shell (which defaults to cmd)), the behavior is always the same. The script exits immediately if run on Windows outside of Buildkite. This all leads me to believe it's the Buildkite agent itself and how it manages processes.

Is this difference expected behavior? We have complex pipelines that spawn a lot of child processes (for example, a simple case is gradle daemons) and they hang indefinitely on Windows.

Is there a way around the behavior? I can't think of anything related to our environment that could cause this.

You can also run, for example, in batch: bash.exe -c 'echo 1; sleep 60 ^& echo 2;'

Tested on: Windows 2022, 2019, 2016 buildkite-agent 3.48.0 and 3.49.0

Note that I e-mailed support, and they asked me to open an issue here.