actions / runner-images

GitHub Actions runner images
MIT License
9.91k stars 3.02k forks source link

An error occurred while provisioning resources (Error Type: Disconnect). #3517

Closed alexlamsl closed 1 year ago

alexlamsl commented 3 years ago

Description
Jobs on macOS would intermittently fail without logs, with the error message in title only appearing some of the time.

Here are a list of failed jobs over the past three days: https://github.com/mishoo/UglifyJS/runs/2730669944?check_suite_focus=true https://github.com/mishoo/UglifyJS/runs/2719886305?check_suite_focus=true https://github.com/mishoo/UglifyJS/runs/2718446501?check_suite_focus=true https://github.com/mishoo/UglifyJS/runs/2716605609?check_suite_focus=true https://github.com/mishoo/UglifyJS/runs/2712627146?check_suite_focus=true https://github.com/mishoo/UglifyJS/runs/2711652206?check_suite_focus=true

Not sure if related, whilst at lower frequency I also encountered jobs being reported as cancelled: https://github.com/mishoo/UglifyJS/runs/2706621544?check_suite_focus=true

Area for Triage: Deployment/Release

Question, Bug, or Feature?: Bug

Virtual environments affected

Image version Current runner version: '2.278.0'

Expected behavior
Jobs complete with viewable logs.

Actual behavior
Missing logs − even with View raw logs:

2021-05-31T14:43:02.9964115Z ##[section]Starting: Request a runner to run this job
2021-05-31T14:43:03.4321453Z Can't find any online and idle self-hosted runner in current repository that matches the required labels: 'macos-latest'
2021-05-31T14:43:03.4321551Z Can't find any online and idle self-hosted runner in current repository's account/organization that matches the required labels: 'macos-latest'
2021-05-31T14:43:03.4321605Z Can't find any online and idle hosted runner in current repository's account/organization that matches the required labels: 'macos-latest'
2021-05-31T14:43:03.4321706Z Found online and busy hosted runners in current repository's account/organization that matches the required labels: 'macos-latest'. Waiting for one of them to get assigned for this job.
2021-05-31T14:43:03.6599602Z ##[section]Finishing: Request a runner to run this job

Repro steps
A description with steps to reproduce the issue. If your have a public example or repo to share, please provide the link.

  1. watch scheduled workflow spawn
  2. occasionally macOS job would fail with missing logs
miketimofeev commented 3 years ago

thanks @alexlamsl for creating a separate issue. Am I right that it is enough to simply fork https://github.com/mishoo/UglifyJS/ and run the following workflow to reproduce the problem? https://github.com/mishoo/UglifyJS/blob/master/.github/workflows/ufuzz.yml

alexlamsl commented 3 years ago

Yes forking the repository and letting the aforementioned workflow run should be able to produce the (intermittent) issue.

You may want to edit out the Linux & Windows jobs to lighten the load since they don't exhibit the same issues.

Please be advised that the job may fail sometimes due to fuzzer hitting a false positive − but they would be distinctly different from the issue due to presence of logs.

alexlamsl commented 3 years ago

Another bunch of recent samples: https://github.com/mishoo/UglifyJS/runs/2763713274?check_suite_focus=true https://github.com/mishoo/UglifyJS/runs/2760713866?check_suite_focus=true https://github.com/mishoo/UglifyJS/runs/2760250084?check_suite_focus=true https://github.com/mishoo/UglifyJS/runs/2757930927?check_suite_focus=true https://github.com/mishoo/UglifyJS/runs/2757074905?check_suite_focus=true

Darleev commented 3 years ago

@alexlamsl thank you for provided examples, we are investigating the issue on our side to determine the exact reason for such behavior. Currently, I found only one thing, that tests consuming a lot of CPU resources on macOS machines, and possibly it leads the pipeline to fail. We need more time to find a root cause for this particular situation. I'll keep you informed.

alexlamsl commented 3 years ago

Not sure if related, but just now I've hit an instance of this but on Windows: https://github.com/mishoo/UglifyJS/runs/2821517386?check_suite_focus=true

miketimofeev commented 3 years ago

@alexlamsl thanks for the update! Windows is a pretty different story, so it's not related. Speaking about mac — we've narrowed down the list of the environments with issues, but unfortunately, we are still searching for the root cause.

alexlamsl commented 3 years ago

Not sure if this helps, but this failed job contains some information under View raw logs: https://github.com/mishoo/UglifyJS/runs/3003410939?check_suite_focus=true

And from a glance it seems to got "cancelled".

code4break commented 3 years ago

Hi,

I have the same issue with macOS 11.

miketimofeev commented 3 years ago

@FireFighter80 do you have access to the macOS-11 pipeline? Just to make sure it's not the access issue

code4break commented 3 years ago

@miketimofeev Thx. You're rights. That was the issue

alexlamsl commented 2 years ago

@miketimofeev has this issue been resolved?

I am still getting steady stream of these job failures, especially in the past week on a daily basis.

miketimofeev commented 2 years ago

@alexlamsl we haven't heard any cases so far that's why decided to close. Could you provide some new examples of such builds so I can escalate the issue?

alexlamsl commented 2 years ago

Ones that are immediately relevant: https://github.com/mishoo/UglifyJS/actions/runs/2306997233 https://github.com/mishoo/UglifyJS/actions/runs/2296601772 https://github.com/mishoo/UglifyJS/actions/runs/2284012926

Others that fail unexpectedly, not sure if related: https://github.com/mishoo/UglifyJS/actions/runs/2281443778 https://github.com/mishoo/UglifyJS/actions/runs/2327083565 https://github.com/mishoo/UglifyJS/actions/runs/2300065092

P.S. for the past few days I would encounter Angry Unicorn ~5% of the time when loading any Actions-related URLs

alexlamsl commented 2 years ago

I can replicate this on my fork as well: https://github.com/alexlamsl/UglifyJS/actions/runs/2310731271 https://github.com/alexlamsl/UglifyJS/actions/runs/2310324508 https://github.com/alexlamsl/UglifyJS/actions/runs/2309311098 https://github.com/alexlamsl/UglifyJS/actions/runs/2307865027 https://github.com/alexlamsl/UglifyJS/actions/runs/2264485393

Others: https://github.com/alexlamsl/UglifyJS/actions/runs/2277862274 https://github.com/alexlamsl/UglifyJS/actions/runs/2263136893

(ran into 🦄🦄🦄 while looking for these)

miketimofeev commented 2 years ago

@alexlamsl thanks, will engage the engineering team

anthony-c-martin commented 2 years ago

We're seeing the same issue very frequently with our Windows builds. There's no detailed errors to indicate what went wrong: https://github.com/Azure/bicep/actions/runs/2633252423

Summary view:

image

Trying to see logs for an individual job:

image

@miketimofeev - any update on this?

niehusstaab commented 2 years ago

Just to chime in here, this is plaguing one of my repos as well, but on ubuntu-latest image. Sadly it's not public, so I can't share any links or anything, but the affected workflow always fails around the 30min mark with either the error:

An error occurred while provisioning resources (Error Type: Disconnect).
Received request to deprovision: The request was cancelled by the remote provider.

or

The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
Process completed with exit code 143.

The workflow is just a simple action running npm quicktype. Occasionally, I will get some logs (as opposed to the typical no logs on the failing step that ran for 30min), but they only ever contain Killed\n. This has been happening for the past 4 months

miketimofeev commented 2 years ago

@niehusstaab even links to private repos will help as we don't need access to your repo to get the telemetry for the run and see if there was high CPU usage or something like that. Most likely this is the root cause.

anthony-c-martin commented 1 year ago

We're seeing the same issue very frequently with our Windows builds. There's no detailed errors to indicate what went wrong: https://github.com/Azure/bicep/actions/runs/2633252423 ... @miketimofeev - any update on this?

To circle back - by chance I discovered that one of our tests was eating up a LOT of system memory, and this issue stopped occurring once I fixed it. It would be super helpful if this information could have been communicated somehow through the workflow logs, and would have saved a lot of time spent debugging.

BryceStevenWilley commented 1 year ago

If links to repos are still useful, here's a public action that failed with this specific error: https://github.com/SuffolkLITLab/ALActions/actions/runs/3913807100, running on ubuntu-latest.

It's a really lightweight action that only runs ~20 lines of beautiful soup python on a small webpage, and normally finishes in < 30 seconds, so I'm fairly confident that it wouldn't be eating up memory or using a lot of CPU. The latest failing job took 26 minutes, but there aren't any logs to see what it was doing in that time.

enjoy-binbin commented 1 year ago

what is the status in here? we have encountered similar problems, for more information, see #7004

prein commented 1 year ago

+1 (ubuntu) One thing that would likely be useful would be to have a way to retrieve missing logs (see @anthony-c-martin comment above).

erik-bershel commented 1 year ago

Due to the fact that virtually all of the cases mentioned here are related to resource consumption above what is possible, I am forced to close this request.

About logs: It is not possible for the moment to publish provisioner logs due to a pack of reasons including security reasons.

For the curious and new arrivals: I recommend paying attention to the discussion with a lot of information from users who encounter the same problem for various reasons: https://github.com/actions/runner-images/discussions/7188.