actions / runner-images

GitHub Actions runner images
MIT License
10.15k stars 3.06k forks source link

Post Run actions/checkout@v4 failed randomly #10609

Closed korrem closed 3 weeks ago

korrem commented 1 month ago

Description

Hi,

For two at least two months we have noticed that our nightly runs a problem that occurs randomly. Sometimes the last step which is Post Run actions/checkout@v4 can take a very long time, up to 15 minutes, after which we get the workflow is either skipped or failed.

For skipped we get error massage Hosted runner encountered an error while running your job. (Error type: Disconnect).. Example can be found here - https://github.com/IMGARENA/multisport-fastpath-scoring-app/actions/runs/10765405236

For failed we get error massage Hosted runner: GitHub Actions 94 has lost communication with the server. Anything in the workflow that terminates the runner's process, deprives it of CPU/memory or blocks network access can cause this error. - here you can see an example - https://github.com/IMGARENA/multisport-fastpath-scoring-app/actions/runs/10712726268.

We have added a step in which we monitor CPU and RAM consumption. However, so far the highest CPU consumption has been a maximum of 10% and the available RAM is around 6GB after the tests have been completed. Here you can see our workflow file -> https://github.com/IMGARENA/multisport-fastpath-scoring-app/blob/develop/.github/workflows/run-e2e-tests.yml and workflow for nightly https://github.com/IMGARENA/multisport-fastpath-scoring-app/blob/develop/.github/workflows/nightly-e2e-tests-without-comparator.yml.

Could you be so kind and help us to resolve this issue?

Platforms affected

Runner images affected

Image version and build link

Version: 20240908.1.0

Is it regression?

https://github.com/IMGARENA/multisport-fastpath-scoring-app/actions/runs/10821818651

Expected behavior

Post Run actions/checkout@v4 step shouldn't take so much time and should finish successfully

Actual behavior

Post Run actions/checkout@v4 step at the end of the workflow takes sometimes even 15 minutes and then fails or skips the whole workflow.

Repro steps

  1. Go to https://github.com/IMGARENA/multisport-fastpath-scoring-app/actions/workflows/nightly-e2e-tests.yml,
  2. Click on Run workflow button,
  3. Select develop branch,
  4. Click on Run Workflow,
hemanthmanga commented 1 month ago

Hi @korrem Thank you for bringing this issue to us. We are looking into this issue and will update you on this issue after investigating.

Prabhatkumar59 commented 1 month ago

Hi @korrem- I am unable to open the url link which you have provided as it shows '404 error'. However, from your description i can clearly see that the issue you are experiencing with the Post Run actions/checkout@v4 step, which randomly takes a long time or fails due to runner disconnection, could be related to various factors like runner resource limitations, network instability, or GitHub service issues.

For you, i am providing some recommendations to help mitigate the problem:-

A. You can add a retry mechanism to the actions/checkout@v4 step to handle random failures. GitHub Actions supports continue-on-error and retry options to prevent the job from completely failing.

- name: Checkout Code
  uses: actions/checkout@v4
  with:
    fetch-depth: 0
  continue-on-error: true

B. You can also try to check if the runner timeout is set too aggressively. Increasing the runner timeout might prevent early termination. timeout-minutes: 30 # Example to increase timeout if needed C. Adding to this, If the issue persists and is critical, consider using a self-hosted runner with more control over resource allocation and network stability. This might avoid disconnection errors.

D. Git Shallow Clone: To reduce the time spent in the checkout step, ensure that you're not fetching unnecessary history.

- uses: actions/checkout@v4
  with:
    fetch-depth: 1  # Fetch only the latest commit

E. Also, Since the error mentions loss of communication with the server, add network-related logging or monitoring to see if there are spikes in network latency or drops that might be affecting the workflow.

Hopefully, these changes should help improve the stability of the actions/checkout@v4 step.

korrem commented 1 month ago

Prabhatkumar59 thanks for your message. I'll try options A and D, and if they don't help then the rest. I will let you know if it helped

Prabhatkumar59 commented 1 month ago

Hi @korrem - Sure let me know, hopefully those changes which I provided to you should help improve the stability.

Prabhatkumar59 commented 3 weeks ago

Hi @korrem - Since we haven't heard back, we'll assume your issue is resolved and will close this issue for now. Feel free to reach out to us for any other queries. Thanks.

korrem commented 1 week ago

Hi Prabhatkumar59, Apologies for the long wait with information on the results, unfortunately, none of your advice helped.

Strange thing is that I see all steps passed except last one (Post Run actions/checkout@v4) but in logs we are seeing like hosted-runner didn't start entire job LOGS: image. Workflow screenshot image

I will be grateful for any other help?