CloudSnorkel / cdk-github-runners

CDK constructs for self-hosted GitHub Actions runners
https://constructs.dev/packages/@cloudsnorkel/cdk-github-runners/
Apache License 2.0
255 stars 37 forks source link

Runner not terminated after cancellation of job #537

Closed pharindoko closed 3 weeks ago

pharindoko commented 3 weeks ago

Hey,

I do have the case that sometimes that a job is cancelled but the runner is not terminated. I get a warning sign in the job in the ui:

build
Runner xxxxxx did not respond to a cancelation request with 00:05:00.

The stepfunction and the runner itself still are in progress and think that there is a ongoing job available. It`s bad because the ec2s are still running without doing anything.

Does somebody else have this issue ? Does anyone have a solution how to fix this behaviour ?

kichik commented 3 weeks ago

That's strange. Do you have more insight on what the instance is doing? Did a job actually start on it? The fact that cancellation is being attempted suggests a job did start.

Even if a job started, eventually the actions runner itself should time out as well (and then terminate the instance). Maybe it will have something useful in its logs once it does?

And if a job wasn't started, the runner should be deleted by the idle reaper. At that point, the runner will stop on the instance and terminate itself.

The only guess I have so far is the instance ran out of memory, started thrashing swap space, and therefore wasn't able to respond to GitHub server causing the cancellation request timing out.

pharindoko commented 3 weeks ago

Good hint. Switched to an instance with more memory.

I will add an additional alert to see when instances idle too long (e.g. for an hour). If that happens more often I will check if related jobs are cacelled and the instance can be terminated.

kichik commented 3 weeks ago

FYI #518 will cause SSM to terminate the instance instead of the instance terminating itself. That might help here too.