Closed pharindoko closed 3 weeks ago
That's strange. Do you have more insight on what the instance is doing? Did a job actually start on it? The fact that cancellation is being attempted suggests a job did start.
Even if a job started, eventually the actions runner itself should time out as well (and then terminate the instance). Maybe it will have something useful in its logs once it does?
And if a job wasn't started, the runner should be deleted by the idle reaper. At that point, the runner will stop on the instance and terminate itself.
The only guess I have so far is the instance ran out of memory, started thrashing swap space, and therefore wasn't able to respond to GitHub server causing the cancellation request timing out.
Good hint. Switched to an instance with more memory.
I will add an additional alert to see when instances idle too long (e.g. for an hour). If that happens more often I will check if related jobs are cacelled and the instance can be terminated.
FYI #518 will cause SSM to terminate the instance instead of the instance terminating itself. That might help here too.
Hey,
I do have the case that sometimes that a job is cancelled but the runner is not terminated. I get a warning sign in the job in the ui:
The stepfunction and the runner itself still are
in progress
and think that there is a ongoing job available. It`s bad because the ec2s are still running without doing anything.Does somebody else have this issue ? Does anyone have a solution how to fix this behaviour ?