actions / runner-container-hooks

Runner Container Hooks for GitHub Actions
MIT License
63 stars 41 forks source link

Job Steps Incorrectly Marked as Successful #165

Open israel-morales opened 1 month ago

israel-morales commented 1 month ago

Hello,

I will apologize in advance that the error is inconsistent and I cannot reproduce on demand.

With the kubernetes runner hooks, we have experienced some job steps incorrectly being marked as successful. This behavior is unexpected and has lead to issues with our dev pipelines.

The two screenshots I have attached show the issue clearly. You can see that the output of the workflow pod is cut off and immediately is marked as successful.

jobfailsuccess

jobfailedsucessfully

Again, this occurs sometimes and it's not clear what the underlying issue is. Nor is the issue limited to a specific job or seems load based.

Any guidance into how we can further troubleshoot or prevent this issue would be appreciated, thank you!

chart version: gha-runner-scale-set-0.9.0 values: values-gha.txt

nikola-jokic commented 1 month ago

Hey @israel-morales,

This one is a tough one... I'll try my best to figure out what is happening, and I'll update you on the progress. Sorry for the delay

nikola-jokic commented 4 weeks ago

Hey @israel-morales,

Can you please let me know if you are still seeing this issue on ARC 0.9.2? I'm wondering if the source of the issue was a controller bug that caused runner container to shutdown before it executed the job? If you are still seeing the issue, can you please provide the runner log? Unfortunately, I failed to reproduce it. I tried killing the workflow container, tried killing the command within the workflow container, and tried killing the child command. I couldn't find a repro, but I'm wondering if ARC issues on 0.9.0 release caused this behavior.

israel-morales commented 1 week ago

@nikola-jokic We have seen the issue occur on ARC 0.9.2. We noticed that killing the pods, processes or even inducing OOM will elicit a proper response from ARC and the Runners, the issue in question is due to something else.

We did manage to capture logs during one of these events, which I'll attach for your review.

screenshot runner.log

The step ends with: Finished process 100 with exit code 0

Let me know if there is anything else I can do to help determine the cause.

genesis-jamin commented 6 days ago

Another example on 0.9.3 (copied from https://github.com/actions/actions-runner-controller/issues/3578):

image

Runner logs: https://gist.github.com/genesis-jamin/774d115df441c3afdd755f73a3c499dc

Grep the logs "Finished process 170 with exit code 0" to see where the sleep 6000 step ends.

genesis-jamin commented 5 days ago

@nikola-jokic which version of k8s have you been testing with? We've seen this error on 1.28 and 1.29.

EDIT: We see this on 1.30 as well.

genesis-jamin commented 4 days ago

Someone on the pytest xdist repo mentioned that this could be related to the k8s exec logic: https://github.com/pytest-dev/pytest-xdist/issues/1106#issuecomment-2225875354