Open brianseeders opened 1 year ago
Also, in order to rule out any issues related to our environment/images, I created a VM in GCP using their base Windows Server 2022 image. I installed chocolatey, git, bash, and buildkite-agent. I connected buildkite-agent to our org and ran my job, and it still hangs. I also tested buildkite-agent v3.0.0, and it still happens on this very old version.
g'day @brianseeders! thanks for this - it definitely seems like something hinky is going on, we're gonna take a look into it.
thinking out loud, we don't run jobs in PTYs on windows, which seems like it could be the source of this - perhaps we could run in PTY mode on windows iff bash is the shell we're in? will think about this a bit more.
Thanks. I'm guessing based on this that it won't be particularly easy to try this?
I've been trying to find a workaround for this. I started experimenting with this: https://github.com/elastic/elasticsearch/blob/cec2769216409fc143cb05048f9ecd0fedc4341a/.buildkite/scripts/windows-end-job.ps1
I'm basically ending every step by sending ctrl-c to the agent process twice. It only works because we're using ephemeral one-shot agents. It works, but the exit status isn't being reported correctly by Buildkite for certain steps and I haven't been able to figure out why.
I have a powershell script that looks like this at the end:
echo "Exiting with $exitCode"
exit $exitCode
and the Buildkite log shows Exiting with 1
, but the step is successful sometimes.
I'll try the latest release as I see there has been a ton of work/refactoring happening, and report back.
Nope, the exit code issue is still present on the latest agent as well. Job uuid 01898929-e64a-4306-90c9-2cffe43c1b2b
if you'd like to see. It reproduces reliably for this job, but the job takes 90 minutes. I'm trying to come up with a smaller example for it.
Ending my day here, and wanted to give one more update. It turns out that the real exit code is lost when the job has to be forcibly terminated (e.g. when you see this: Job 01898929-e64a-4306-90c9-2cffe43c1b2b hasn't stopped in time, terminating
). It always shows as successful. Graceful exits will capture the exit code correctly.
I'm guessing the hanging issue itself centers around the job object / process group stuff, but I've been trying to understand why. Maybe golang's cmd.Wait() ends up waiting for the process job itself to complete? I'm not sure.
Another update here. I finally seem to have figured out a workaround, with an idea from @rjernst.
https://github.com/elastic/elasticsearch/blob/buildkite-migration/.buildkite/scripts/run-script.ps1
I'm creating my own nested job, and closing it when the main script finishes executing. This cleans up any lingering processes, which allows the buildkite-agent to move on.
Thanks for sharing the workaround here @brianseeders 💖
I have the following script used in a Buildkite step (for reproduction purposes):
echo 1 sleep 60 & echo 2
Note that
sleep 60 &
goes to the background, so this script exits immediately.On linux, when running this script in a Buildkite step, the step finishes immediately (not after waiting 60s), as I would expect.
On Windows, however, the buildkite step hangs for 60 seconds, waiting for the child process to finish, even though the parent process completed.
It doesn't matter what shell is specified (I've tried powershell, pwsh, bash, and no shell (which defaults to cmd)), the behavior is always the same. The script exits immediately if run on Windows outside of Buildkite. This all leads me to believe it's the Buildkite agent itself and how it manages processes.
Is this difference expected behavior? We have complex pipelines that spawn a lot of child processes (for example, a simple case is gradle daemons) and they hang indefinitely on Windows.
Is there a way around the behavior? I can't think of anything related to our environment that could cause this.
You can also run, for example, in batch:
bash.exe -c 'echo 1; sleep 60 ^& echo 2;'
Tested on: Windows 2022, 2019, 2016 buildkite-agent 3.48.0 and 3.49.0
Note that I e-mailed support, and they asked me to open an issue here.
I have the following script used in a Buildkite step (for reproduction purposes):
Note that
sleep 60 &
goes to the background, so this script exits immediately.On linux, when running this script in a Buildkite step, the step finishes immediately (not after waiting 60s), as I would expect.
On Windows, however, the buildkite step hangs for 60 seconds, waiting for the child process to finish, even though the parent process completed.
It doesn't matter what shell is specified (I've tried powershell, pwsh, bash, and no shell (which defaults to cmd)), the behavior is always the same. The script exits immediately if run on Windows outside of Buildkite. This all leads me to believe it's the Buildkite agent itself and how it manages processes.
Is this difference expected behavior? We have complex pipelines that spawn a lot of child processes (for example, a simple case is gradle daemons) and they hang indefinitely on Windows.
Is there a way around the behavior? I can't think of anything related to our environment that could cause this.
You can also run, for example, in batch:
bash.exe -c 'echo 1; sleep 60 ^& echo 2;'
Tested on: Windows 2022, 2019, 2016 buildkite-agent 3.48.0 and 3.49.0
Note that I e-mailed support, and they asked me to open an issue here.