JuliaCI / sandboxed-buildkite-agent

2 stars 0 forks source link

Windows Buildkite agents are taking new jobs before the VM has rebooted #42

Open DilumAluthge opened 1 year ago

DilumAluthge commented 1 year ago

This often manifests itself with the following error message:

"Secret private key path 'C:\secrets/agent.key' is not a valid private RSA key!"

staticfloat commented 1 year ago

Thanks to the debugging code added in, we can see in this example that the previous run was canceled by a SIGINT, presumably sent because the entire job was canceled due to a new commit being pushed before the old commit was done testing:

Old run: https://buildkite.com/julialang/julia-master/builds/20727#0186216b-d26a-4bde-b924-469346201a0b/4392-4408 New run: https://buildkite.com/julialang/julia-master/builds/20728#01862174-1b9b-4a4a-8aa9-a548b4eb7635/6-23

This shouldn't cause a problem, but clearly we do have a problem, so let's trace through and see if we can find any issues with our reboot functionality:

So the smoking gun points to, unsurprisingly, the NSSM script that should be invoking shutdown, but it's failing for some reason.

Keno commented 1 month ago
  • Next, nssm should be invoking shutdown immediately after the buildkite-agent process exits. It's unclear if this is happening or not.

I'm always nervous about shell quoting, particularly on Windows. Are we sure that this ends up getting parsed appropriately by cmd and not appended as garbage to the end of the buildkite invocation or to some internal command line inside nssm?

staticfloat commented 1 month ago

I logged into a machine running a job, and used a powershell script to tail the windows event log for nssm and for User32, this is what I saw:

   ProviderName: nssm

TimeCreated                      Id LevelDisplayName Message
-----------                      -- ---------------- -------
7/26/2024 5:52:02 PM           1040 Information      Service buildkite-agent received START control, which will be handled.
7/26/2024 5:52:04 PM           1008 Information      Started C:\Windows\System32\cmd.exe /C C:\buildkite-agent\bin\buildkite-agent.exe start & shutdown /s /t 0 /f /d p:4:1 for service buildkite-agent in C:\W... 
7/26/2024 5:52:17 PM           1040 Information      Service buildkite-agent received SHUTDOWN control, which will be handled.
7/26/2024 5:52:17 PM           1011 Information      Killing process C:\Windows\System32\cmd.exe because service buildkite-agent is stopping.
7/26/2024 5:52:31 PM           1040 Information      Service buildkite-agent received START control, which will be handled.
7/26/2024 5:52:32 PM           1008 Information      Started C:\Windows\System32\cmd.exe /C C:\buildkite-agent\bin\buildkite-agent.exe start & shutdown /s /t 0 /f /d p:4:1 for service buildkite-agent in C:\W... 
7/26/2024 5:52:02 PM           1040 Information      Service buildkite-agent received START control, which will be handled.                                                                                       
7/26/2024 5:52:04 PM           1008 Information      Started C:\Windows\System32\cmd.exe /C C:\buildkite-agent\bin\buildkite-agent.exe start & shutdown /s /t 0 /f /d p:4:1 for service buildkite-agent in C:\W... 
7/26/2024 5:52:17 PM           1040 Information      Service buildkite-agent received SHUTDOWN control, which will be handled.
7/26/2024 5:52:17 PM           1011 Information      Killing process C:\Windows\System32\cmd.exe because service buildkite-agent is stopping.
7/26/2024 5:52:31 PM           1040 Information      Service buildkite-agent received START control, which will be handled.
7/26/2024 5:52:32 PM           1008 Information      Started C:\Windows\System32\cmd.exe /C C:\buildkite-agent\bin\buildkite-agent.exe start & shutdown /s /t 0 /f /d p:4:1 for service buildkite-agent in C:\W...
7/26/2024 6:21:31 PM           1013 Information      Program C:\Windows\System32\cmd.exe for service buildkite-agent exited with return code 0.
7/26/2024 6:21:31 PM           1023 Information      Killing process tree of process 4048 for service buildkite-agent with exit code 0
7/26/2024 6:21:31 PM           1027 Information      Killing PID 4048 in process tree of PID 4048 because service buildkite-agent is stopping.
7/26/2024 6:21:31 PM           1016 Information      Service buildkite-agent action for exit code 0 is Exit. Exiting.
7/26/2024 6:21:31 PM           1012 Information      Requested stop of service buildkite-agent.  No action is required as program C:\Windows\System32\cmd.exe is not running.
   ProviderName: User32

TimeCreated                      Id LevelDisplayName Message                                                                                                                                                      
-----------                      -- ---------------- -------                                                                                                                                                      
6/26/2024 1:37:33 AM           1074 Information      The process C:\Windows\system32\winlogon.exe (MINWINPC) has initiated the restart of computer WIN-HKTBISMGI26 on behalf of user NT AUTHORITY\SYSTEM for th...
6/25/2024 6:46:41 PM           1074 Information      The process C:\Windows\system32\wbem\wmiprvse.exe (WINBASE) has initiated the restart of computer WINBASE on behalf of user WINBASE\julia for the followin...
6/25/2024 7:04:29 PM           1074 Information      The process C:\Windows\system32\wbem\wmiprvse.exe (WINBASE) has initiated the restart of computer WINBASE on behalf of user WINBASE\julia for the followin...
6/25/2024 7:17:11 PM           1074 Information      The process wininit.exe (WINBASE) has initiated the power off of computer WINBASE on behalf of user WINBASE\Administrator for the following reason: Applic...
6/25/2024 7:24:09 PM           1074 Information      The process wininit.exe (WINBASE) has initiated the power off of computer WINBASE on behalf of user WINBASE\Administrator for the following reason: Applic...
7/26/2024 5:52:08 PM           1074 Information      The process C:\Windows\system32\winlogon.exe (WIN2K22-AMDCI6-) has initiated the restart of computer WIN2K22-AMDCI6- on behalf of user NT AUTHORITY\SYSTEM...
6/26/2024 1:37:33 AM           1074 Information      The process C:\Windows\system32\winlogon.exe (MINWINPC) has initiated the restart of computer WIN-HKTBISMGI26 on behalf of user NT AUTHORITY\SYSTEM for th...
6/25/2024 6:46:41 PM           1074 Information      The process C:\Windows\system32\wbem\wmiprvse.exe (WINBASE) has initiated the restart of computer WINBASE on behalf of user WINBASE\julia for the followin...
6/25/2024 7:04:29 PM           1074 Information      The process C:\Windows\system32\wbem\wmiprvse.exe (WINBASE) has initiated the restart of computer WINBASE on behalf of user WINBASE\julia for the followin...
6/25/2024 7:17:11 PM           1074 Information      The process wininit.exe (WINBASE) has initiated the power off of computer WINBASE on behalf of user WINBASE\Administrator for the following reason: Applic...
6/25/2024 7:24:09 PM           1074 Information      The process wininit.exe (WINBASE) has initiated the power off of computer WINBASE on behalf of user WINBASE\Administrator for the following reason: Applic...
7/26/2024 5:52:08 PM           1074 Information      The process C:\Windows\system32\winlogon.exe (WIN2K22-AMDCI6-) has initiated the restart of computer WIN2K22-AMDCI6- on behalf of user NT AUTHORITY\SYSTEM...
7/26/2024 6:21:31 PM           1074 Information      The process C:\Windows\System32\shutdown.exe (WIN2K22-AMDCI6-) has initiated the shutdown of computer WIN2K22-AMDCI6- on behalf of user WIN2K22-AMDCI6-\ju...

So it looks to me like it is correctly invoking the shutdown command. In any case, if it weren't, our bots wouldn't work at all, rather than not working sporadically.

staticfloat commented 1 month ago

I wonder if shutdown can ever fail? Is that possible?

Keno commented 1 month ago

Those are all the success case though. Do we have any logs for the failure case?

staticfloat commented 1 month ago

No, I have to login and instrument the VM beforehand, as all state is lost after each job finishes.

Keno commented 1 month ago

Can we dump the logs in one of the early scripts if C:\secrets/agent.key is missing, which is one of the indicators of this failure?