Open DilumAluthge opened 1 year ago
Thanks to the debugging code added in, we can see in this example that the previous run was canceled by a SIGINT, presumably sent because the entire job was canceled due to a new commit being pushed before the old commit was done testing:
Old run: https://buildkite.com/julialang/julia-master/builds/20727#0186216b-d26a-4bde-b924-469346201a0b/4392-4408 New run: https://buildkite.com/julialang/julia-master/builds/20728#01862174-1b9b-4a4a-8aa9-a548b4eb7635/6-23
This shouldn't cause a problem, but clearly we do have a problem, so let's trace through and see if we can find any issues with our reboot functionality:
nssm
should be invoking shutdown
immediately after the buildkite-agent
process exits. It's unclear if this is happening or not.So the smoking gun points to, unsurprisingly, the NSSM script that should be invoking shutdown, but it's failing for some reason.
- Next,
nssm
should be invokingshutdown
immediately after thebuildkite-agent
process exits. It's unclear if this is happening or not.
I'm always nervous about shell quoting, particularly on Windows. Are we sure that this ends up getting parsed appropriately by cmd
and not appended as garbage to the end of the buildkite invocation or to some internal command line inside nssm?
I logged into a machine running a job, and used a powershell script to tail the windows event log for nssm
and for User32
, this is what I saw:
ProviderName: nssm
TimeCreated Id LevelDisplayName Message
----------- -- ---------------- -------
7/26/2024 5:52:02 PM 1040 Information Service buildkite-agent received START control, which will be handled.
7/26/2024 5:52:04 PM 1008 Information Started C:\Windows\System32\cmd.exe /C C:\buildkite-agent\bin\buildkite-agent.exe start & shutdown /s /t 0 /f /d p:4:1 for service buildkite-agent in C:\W...
7/26/2024 5:52:17 PM 1040 Information Service buildkite-agent received SHUTDOWN control, which will be handled.
7/26/2024 5:52:17 PM 1011 Information Killing process C:\Windows\System32\cmd.exe because service buildkite-agent is stopping.
7/26/2024 5:52:31 PM 1040 Information Service buildkite-agent received START control, which will be handled.
7/26/2024 5:52:32 PM 1008 Information Started C:\Windows\System32\cmd.exe /C C:\buildkite-agent\bin\buildkite-agent.exe start & shutdown /s /t 0 /f /d p:4:1 for service buildkite-agent in C:\W...
7/26/2024 5:52:02 PM 1040 Information Service buildkite-agent received START control, which will be handled.
7/26/2024 5:52:04 PM 1008 Information Started C:\Windows\System32\cmd.exe /C C:\buildkite-agent\bin\buildkite-agent.exe start & shutdown /s /t 0 /f /d p:4:1 for service buildkite-agent in C:\W...
7/26/2024 5:52:17 PM 1040 Information Service buildkite-agent received SHUTDOWN control, which will be handled.
7/26/2024 5:52:17 PM 1011 Information Killing process C:\Windows\System32\cmd.exe because service buildkite-agent is stopping.
7/26/2024 5:52:31 PM 1040 Information Service buildkite-agent received START control, which will be handled.
7/26/2024 5:52:32 PM 1008 Information Started C:\Windows\System32\cmd.exe /C C:\buildkite-agent\bin\buildkite-agent.exe start & shutdown /s /t 0 /f /d p:4:1 for service buildkite-agent in C:\W...
7/26/2024 6:21:31 PM 1013 Information Program C:\Windows\System32\cmd.exe for service buildkite-agent exited with return code 0.
7/26/2024 6:21:31 PM 1023 Information Killing process tree of process 4048 for service buildkite-agent with exit code 0
7/26/2024 6:21:31 PM 1027 Information Killing PID 4048 in process tree of PID 4048 because service buildkite-agent is stopping.
7/26/2024 6:21:31 PM 1016 Information Service buildkite-agent action for exit code 0 is Exit. Exiting.
7/26/2024 6:21:31 PM 1012 Information Requested stop of service buildkite-agent. No action is required as program C:\Windows\System32\cmd.exe is not running.
ProviderName: User32
TimeCreated Id LevelDisplayName Message
----------- -- ---------------- -------
6/26/2024 1:37:33 AM 1074 Information The process C:\Windows\system32\winlogon.exe (MINWINPC) has initiated the restart of computer WIN-HKTBISMGI26 on behalf of user NT AUTHORITY\SYSTEM for th...
6/25/2024 6:46:41 PM 1074 Information The process C:\Windows\system32\wbem\wmiprvse.exe (WINBASE) has initiated the restart of computer WINBASE on behalf of user WINBASE\julia for the followin...
6/25/2024 7:04:29 PM 1074 Information The process C:\Windows\system32\wbem\wmiprvse.exe (WINBASE) has initiated the restart of computer WINBASE on behalf of user WINBASE\julia for the followin...
6/25/2024 7:17:11 PM 1074 Information The process wininit.exe (WINBASE) has initiated the power off of computer WINBASE on behalf of user WINBASE\Administrator for the following reason: Applic...
6/25/2024 7:24:09 PM 1074 Information The process wininit.exe (WINBASE) has initiated the power off of computer WINBASE on behalf of user WINBASE\Administrator for the following reason: Applic...
7/26/2024 5:52:08 PM 1074 Information The process C:\Windows\system32\winlogon.exe (WIN2K22-AMDCI6-) has initiated the restart of computer WIN2K22-AMDCI6- on behalf of user NT AUTHORITY\SYSTEM...
6/26/2024 1:37:33 AM 1074 Information The process C:\Windows\system32\winlogon.exe (MINWINPC) has initiated the restart of computer WIN-HKTBISMGI26 on behalf of user NT AUTHORITY\SYSTEM for th...
6/25/2024 6:46:41 PM 1074 Information The process C:\Windows\system32\wbem\wmiprvse.exe (WINBASE) has initiated the restart of computer WINBASE on behalf of user WINBASE\julia for the followin...
6/25/2024 7:04:29 PM 1074 Information The process C:\Windows\system32\wbem\wmiprvse.exe (WINBASE) has initiated the restart of computer WINBASE on behalf of user WINBASE\julia for the followin...
6/25/2024 7:17:11 PM 1074 Information The process wininit.exe (WINBASE) has initiated the power off of computer WINBASE on behalf of user WINBASE\Administrator for the following reason: Applic...
6/25/2024 7:24:09 PM 1074 Information The process wininit.exe (WINBASE) has initiated the power off of computer WINBASE on behalf of user WINBASE\Administrator for the following reason: Applic...
7/26/2024 5:52:08 PM 1074 Information The process C:\Windows\system32\winlogon.exe (WIN2K22-AMDCI6-) has initiated the restart of computer WIN2K22-AMDCI6- on behalf of user NT AUTHORITY\SYSTEM...
7/26/2024 6:21:31 PM 1074 Information The process C:\Windows\System32\shutdown.exe (WIN2K22-AMDCI6-) has initiated the shutdown of computer WIN2K22-AMDCI6- on behalf of user WIN2K22-AMDCI6-\ju...
So it looks to me like it is correctly invoking the shutdown
command. In any case, if it weren't, our bots wouldn't work at all, rather than not working sporadically.
I wonder if shutdown
can ever fail? Is that possible?
Those are all the success case though. Do we have any logs for the failure case?
No, I have to login and instrument the VM beforehand, as all state is lost after each job finishes.
Can we dump the logs in one of the early scripts if C:\secrets/agent.key
is missing, which is one of the indicators of this failure?
This often manifests itself with the following error message: