CloudSnorkel / cdk-github-runners

CDK constructs for self-hosted GitHub Actions runners
https://constructs.dev/packages/@cloudsnorkel/cdk-github-runners/
Apache License 2.0
262 stars 36 forks source link

windows ec2 doesn`t reach the stop state in ec2 userdata script #570

Open pharindoko opened 2 months ago

pharindoko commented 2 months ago

Hey @kichik,

I had one special use case which I can replicate. While the job has been successfully completed in github, the ec2 instance and the step function job execution are still running.

runner.log

Current runner version: '2.316.1'
2024-05-15 09:41:16Z: Listening for Jobs
2024-05-15 09:41:19Z: Running job: test_config
2024-05-15 10:05:49Z: Job test_config completed with result: Canceled
./run.cmd : An error occurred: Access denied. System:ServiceIdentity;DDDDDDDD-DDDD-DDDD-DDDD-DDDDDDDDDDDD needs View
permissions to perform the action.
At C:\Windows\system32\config\systemprofile\AppData\Local\Temp\EC2Launch988827203\UserScript.ps1:48 char:3
+   ./run.cmd 2>&1 | Out-File -Encoding ASCII -Append /actions/runner.l ...
+   ~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (An error occurr...orm the action.:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

"Runner listener exit with retryable error, re-launch runner in 5 seconds."
"Restarting runner..."
        1 file(s) copied.

? Connected to GitHub

Failed to create a session. The runner registration has been deleted from the server, please re-configure. Runner
registrations are automatically deleted for runners that have not connected to the service recently.
"Runner listener exit with terminated error, stop the service, no retry needed."
"Exiting runner..."

What`s the problem:

The machine is still running and we waste money until we recognize it. (yes additional alerting in this case would make sense too but I haven`t yet in place.)

Proposal:

It would be great to have a try catch block around the action statement in powershell https://github.com/CloudSnorkel/cdk-github-runners/blob/f08da20f3fe70ae8fc86f85db304b15e191601f3/src/providers/ec2.ts#L165

to ensure the machine get`s terminated https://github.com/CloudSnorkel/cdk-github-runners/blob/f08da20f3fe70ae8fc86f85db304b15e191601f3/src/providers/ec2.ts#L172

kichik commented 2 months ago

I'm not PowerShell expert, but I do believe we are already doing that. Are you sure these are the logs of the right instance? It seems like a log of a runner that the idle reaper terminated. In that case, the step function execution should have also been aborted.

pharindoko commented 2 months ago

Yes I'm very sure that it's the right instance. We were able to replicate the issue running the same job again. It's clear that we should fix this job in anyway - but it still would be nice to see that the machine is stopped ecen when an error appears in the action function.

Try catch in powershell: https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_try_catch_finally?view=powershell-7.4

kichik commented 2 months ago

Would you be able pull up the user data log from that machine so I can better understand what exactly failed there? It should be in C:\ProgramData\Amazon\EC2-Windows\Launch\Log\UserdataExecution.log. As far as I understand PowerShell, executing a script (like run.cmd executed by action()) doesn't raise exceptions. Either way I'd like to both fix the error and possibly add try/catch.

pharindoko commented 2 months ago

Would you be able pull up the user data log from that machine so I can better understand what exactly failed there? It should be in C:\ProgramData\Amazon\EC2-Windows\Launch\Log\UserdataExecution.log. As far as I understand PowerShell, executing a script (like run.cmd executed by action()) doesn't raise exceptions. Either way I'd like to both fix the error and possibly add try/catch.

couldn`t find the UserdataExecution.log ...

aws mentions it here ....

You can't find the user data logs

The log files for EC2Launch, EC2Launch v2, and EC2Config contain the output from the standard output and standard error streams. You can access the log files at the following locations:

    EC2Launch v2: C:\ProgramData\Amazon\EC2Launch\log\agent.log
    EC2Launch: C:\ProgramData\Amazon\EC2-Windows\Launch\Log\UserdataExecution.log
    EC2Config: C:\Program Files\Amazon\Ec2ConfigService\Logs\Ec2ConfigLog.txt

guess we use ec2launch v2 and I found the agent.log. will provide it to you...