actions / runner-images

GitHub Actions runner images
MIT License
9.79k stars 3k forks source link

The runner has received a shutdown signal. #6709

Closed JakubMosakowski closed 1 year ago

JakubMosakowski commented 1 year ago

Description

Since yesterday, our GitHub action builds started to randomly fail (we didn't change anything in our configuration). The error is not very precise, unfortunately.

The process is stopped in random stages of the build (but always after at least 15 minutes or so). Even if the build passes it takes much longer than before (~25 min clean build to ~35 min now).

2022-12-07T10:18:10.5771753Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
2022-12-07T10:18:10.7098386Z ##[error]The operation was canceled.
2022-12-07T10:18:10.7710701Z Cleaning up orphan processes
2022-12-07T10:18:10.8338404Z Terminate orphan process: pid (1849) (java)

Sometimes before the shutdown signal, there is also such log: Idle daemon unexpectedly exit. This should not happen.

Workflow passes normally on builds that are shorter (for example those from cache).

Platforms affected

Runner images affected

Image version and build link

Image: ubuntu-22.04 Version: 20221127.1 Current runner version: '2.299.1'

Unfortunately, it happens on the private repo.

Is it regression?

No

Expected behavior

Job should pass

Actual behavior

Job fails

Repro steps

Looks similar to: https://github.com/actions/runner-images/issues/6680

erik-bershel commented 1 year ago

@JakubMosakowski we cannot do any investigation without additional info. I see that your machine got the shutdown signal. Most often, this means that the resources consumed in the process exceeded the limits. We can theoretically check whether this is so. But we need to see an example of the pipeline that caused the outage and links to failed uses. Even if they belong to a private repository.

JakubMosakowski commented 1 year ago

Sure.

Examples of failing builds: https://github.com/SpotOnInc/android-omnichannel/actions/runs/3638760141 https://github.com/SpotOnInc/android-omnichannel/actions/runs/3638651891 https://github.com/SpotOnInc/android-omnichannel/actions/runs/3638619301 https://github.com/SpotOnInc/android-omnichannel/actions/runs/3638519339 https://github.com/SpotOnInc/android-omnichannel/actions/runs/3638515786

JakubMosakowski commented 1 year ago

The interesting part is that it doesn't seem to be related to any of our changes. I created a branch that is reverted by the last X commits (to the point in history where our builds were smooth) and they are not passing anymore.

mvarrieur commented 1 year ago

We are also seeing this after upgrading our self hosted runners from 20.04 to 22.04 with no other seemingly related changes. Do the 22.04 runners have more conservative limits even when using self hosted?

ihor-panasiuk95 commented 1 year ago

The same happens to us in private repo. Builds started to randomly fail with this error: image

We didn't do any significant changes to workflows.

erik-bershel commented 1 year ago

Hi @ihor-panasiuk95, please send me a links to workflow runs both with positive and negative results.

ihor-panasiuk95 commented 1 year ago

@erik-bershel will you be able to visit them taking into account that they are in private repo?

erik-bershel commented 1 year ago

@ihor-panasiuk95 it's not a problem. There is no need to check what is going on in your private repository in the first step. I want to check the load on agents and compare successful and failed jobs. If the information is not enough, then we will discuss the repro-steps. For example: https://github.com/owner/repo/actions/runs/runID or https://github.com/erik-bershel/erik-tests/actions/runs/3680567148.

ihor-panasiuk95 commented 1 year ago

@erik-bershel Negative - https://github.com/anecdotes-ai/frontend/actions/runs/3742047670 Positive (I replaced ubuntu-latest with ubuntu-22.04 and it started to work) - https://github.com/anecdotes-ai/frontend/actions/runs/3748531101

qhy040404 commented 1 year ago

I find that this issue only occurs when using ubuntu-latest which means ubuntu-22.04. However, it doesn't happen when using ubuntu-20.04 Negative - https://github.com/qhy040404/LibChecker/actions/runs/3747087072 Positive - https://github.com/qhy040404/LibChecker/actions/runs/3748900876/jobs/6366765819

aronatkins commented 1 year ago

We are also seeing these errors regularly. Link to one of our most recent runs: https://github.com/rstudio/connect/actions/runs/3912304431/jobs/6687076068

The output from the job (a docker compose build) is also highly repetitive; I don't believe we had seen that phenomenon previous to these 143 termination problems.

eygraber commented 1 year ago

I've been seeing similar issues where I either get The runner has received a shutdown signal. or some of my processes just never start (using Gradle and Kotlin, the Gradle daemon starts, but the Kotlin daemon never starts).

isaactpetersen commented 1 year ago

I just recently began experiencing this issue. I have never experienced it before.

Here's the error I receive:

The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
Process completed with exit code 143.

Here's a link to one of our recent runs: https://github.com/DevPsyLab/DataAnalysis/actions/runs/3964736564/attempts/2

chrisui commented 1 year ago

We are seeing this issue consistently on pr/branch workflows at the step run with configure-aws-credentials on ubuntu-latest-4-cores. The weird thing is we have an identical step that runs in a different workflow (ie. our CD) with the exact same runner that succeeds. Debug logs below. Looks like the job is killed consistently immediately upon starting to execute this step.

##[debug]Evaluating condition for step: 'Run aws-actions/configure-aws-credentials@master'
##[debug]Evaluating: success()
##[debug]Evaluating success:
##[debug]=> true
##[debug]Result: true
##[debug]Starting: Run aws-actions/configure-aws-credentials@master
##[debug]Register post job cleanup for action: aws-actions/configure-aws-credentials@master
##[debug]Loading inputs
##[debug]Evaluating: secrets.DEPLOY_PREVIEW_ROLE
##[debug]Evaluating Index:
##[debug]..Evaluating secrets:
##[debug]..=> Object
##[debug]..Evaluating String:
##[debug]..=> 'DEPLOY_PREVIEW_ROLE'
##[debug]=> '***'
##[debug]Result: '***'
##[debug]Evaluating: env.AWS_DEFAULT_REGION
##[debug]Evaluating Index:
##[debug]..Evaluating env:
##[debug]..=> Object
##[debug]..Evaluating String:
##[debug]..=> 'AWS_DEFAULT_REGION'
##[debug]=> 'eu-west-[2](https://github.com/gaia-family/monorepo/actions/runs/4024185164/jobs/6943228995#step:5:2)'
##[debug]Result: '***'
##[debug]Loading env
Run aws-actions/configure-aws-credentials@master
##[debug]Re-evaluate condition on job cancellation for step: 'Run aws-actions/configure-aws-credentials@master'.
##[debug]Skip Re-evaluate condition on runner shutdown.
Error: The operation was canceled.
##[debug]System.OperationCanceledException: The operation was canceled.
##[debug]   at System.Threading.CancellationToken.ThrowOperationCanceledException()
##[debug]   at GitHub.Runner.Sdk.ProcessInvoker.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel`1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)
##[debug]   at GitHub.Runner.Common.ProcessInvokerWrapper.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel`1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)
##[debug]   at GitHub.Runner.Worker.Handlers.DefaultStepHost.ExecuteAsync(IExecutionContext context, String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Boolean inheritConsoleHandler, String standardInInput, CancellationToken cancellationToken)
##[debug]   at GitHub.Runner.Worker.Handlers.NodeScriptActionHandler.RunAsync(ActionRunStage stage)
##[debug]   at GitHub.Runner.Worker.ActionRunner.RunAsync()
##[debug]   at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
##[debug]Finishing: Run aws-actions/configure-aws-credentials@master
erik-bershel commented 1 year ago

Hey @chrisui! Please, provide links with one successful and one unsuccessful runs.

chrisui commented 1 year ago

Failed: https://github.com/gaia-family/monorepo/actions/runs/4042706837 Success: https://github.com/gaia-family/monorepo/actions/runs/4045407852

@erik-bershel thanks

codyseibert commented 1 year ago

We've been seeing this similar issue on our builds. Usually it's only related to our cypress workflows.

2023-01-30T21:49:04.3960196Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
2023-01-30T21:49:04.5435392Z ##[error]The operation was canceled.

no one is manually stopping the action and it seems to happen randomly. We are on ubuntu-latest for our runners.

mustafaozhan commented 1 year ago

I confirm this happens to me as well

wax911 commented 1 year ago

+1 This happens at random for me, often time re-running the failed runner works, but I've only had this problem with PR runners but not push event runners

chrisui commented 1 year ago

Just want to follow up and say this is happening consistently on a pull-request usage of the runner. The pull-request itself is bumping to use ubuntu-latest-4-cores over the standard runner. We have been using the same runner with practically same workflow steps on main/tag-push with no issues for a week now.

And for us it's always on the step with configure-aws-credentials that immediately fails.

erik-bershel commented 1 year ago

Hey @chrisui, Did issue happen only with Large Runners in your case? Just to be clear.

chrisui commented 1 year ago

Hey @chrisui, Did issue happen only with Large Runners in your case? Just to be clear.

Exactly. The only change on the pr that's consistently failing is the change of runner.

erik-bershel commented 1 year ago

@wax911 @mustafaozhan @codyseibert Hi friends! May I ask you to share a links on the couple failed runs and couple green passes too? It will help us to reach root of issue. You may share links on the private repos - all we need just links without access.

wax911 commented 1 year ago

@erik-bershel

This has been happening frequently where the push trigger often builds without any errors but the pull request trigger has been failing lately and most times just re-running the job typically resolves the issue, at first I though the jobs were taking too long so I add gradle-caches

erik-bershel commented 1 year ago

@wax911 as I see in you case it might be a resources limitations hitting. Last message from runner before it was lost: "PreciseTimeStamp": 2023-01-28T11:39:55.3878501Z, "Message": [signal]{"usage":{"process":"java","stddev":7.55626377842431,"cpu":85.77531585041434}} As you can see Java process consume almost all CPU time. But it is not a final diagnose - we continue investigating reasons of this incidents.

wax911 commented 1 year ago

Thank you @erik-bershel 😃

kNoAPP commented 1 year ago

Hello, I started experiencing this exact issue today on one of my private repositories that has otherwise ran just fine for over a year. No changes to the runner. Though I did add a terraform command to one of the scripts it executes. I don't think this would have an impact.

I am the only one with access to this private repository. I did not cancel the Action. This is happening repeatedly about 4 minutes into the run.

The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

The operation was canceled.

Error: Process completed with exit code 143.

Thanks, adding a watch to this thread in the meantime.

erik-bershel commented 1 year ago

Hi @kNoAPP, Please provide links to both positive and negative runs. I cannot say anything specific about reasons without links.

rlinstorres commented 1 year ago

Hi @erik-bershel, I've got the same error in our workflows, as a runner we are using ubuntu-latest-4-cores. Basically, during the job, we have these steps: clone -> aws-actions/configure-aws-credentials -> setup node -> setup cache -> npm commands -> docker/build-push-action

Job: https://github.com/FindHotel/daedalus/actions/runs/4125469274/jobs/7126107593

Thank you!

mustafaozhan commented 1 year ago

It turns out that my issue was a gradle issue under the hood, they even have an open issue for this: https://github.com/gradle/gradle/issues/19750

Summary: IF you have already org.gradle.jvmargs in your gradle.properties you need to specify all the arguments since you lose some of the default values.

I got the solution by adding this line:

org.gradle.jvmargs=-Xmx8g -XX:MaxMetaspaceSize=2g -XX:MaxPermSize=2g -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParallelGC -Dfile.encoding=UTF-8

Note: the important thing is not the values I provided, but the fact that I supply all the parameters, so you can change the values. Or ofc. not having this line at all is also a solution

kNoAPP commented 1 year ago

I also found a temporary workaround. For me, running golangci-lint run -v in one of my build steps seems to be the culprit. Still unsure why now, of all times, this started happening.

mohsin-motorway commented 1 year ago

Facing the same issue on the following Gradle execution; gradle(task: "clean bundleStage --parallel")

Locally the gradle command is successful. Pipeline is failing only with the recent push which majorly has the android gradle plugin updated to 7.1.3. I am still using the following initialisation code to setup the environment;

name: Deploy to Firebase App Distribution [Stage]
    runs-on: ubuntu-latest

    steps:
      - name: Checkout
        uses: actions/checkout@v3

      - uses: actions/checkout@v3
      - name: Set up JDK
        uses: actions/setup-java@v3
        with:
          distribution: 'zulu'
          java-version: 11
          check-latest: true
          cache: 'gradle'

      - uses: ruby/setup-ruby@v1
        with:
          ruby-version: '3.0' # Not needed with a .ruby-version file
          bundler-cache: true

Do I need to update me java version to 17?

The detailed stack trace on failure is as following;

`[16:51:07]: *** finished with errors 1208 Error: The operation was canceled. 1209

[debug]System.OperationCanceledException: The operation was canceled.

1210

[debug] at System.Threading.CancellationToken.ThrowOperationCanceledException()

1211

[debug] at GitHub.Runner.Sdk.ProcessInvoker.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)

1212

[debug] at GitHub.Runner.Common.ProcessInvokerWrapper.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)

1213

[debug] at GitHub.Runner.Worker.Handlers.DefaultStepHost.ExecuteAsync(IExecutionContext context, String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Boolean inheritConsoleHandler, String standardInInput, CancellationToken cancellationToken)

1214

[debug] at GitHub.Runner.Worker.Handlers.ScriptHandler.RunAsync(ActionRunStage ***)

1215

[debug] at GitHub.Runner.Worker.ActionRunner.RunAsync()

1216

[debug] at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)

1217

[debug]Finishing: Build and Deploy Android Stage with versionName=RW-354-not-shot`

Any idea what could be going wrong?

mohsin-motorway commented 1 year ago

Issue resolved by adding Gradle properties flags by @mustafaozhan

rlinstorres commented 1 year ago

Hi @erik-bershel, I've got the same error in our workflows, as a runner we are using ubuntu-latest-4-cores. Basically, during the job, we have these steps: clone -> aws-actions/configure-aws-credentials -> setup node -> setup cache -> npm commands -> docker/build-push-action

Job: https://github.com/FindHotel/daedalus/actions/runs/4125469274/jobs/7126107593

Thank you!

Update!

I've fixed my issue by changing the runner from ubuntu-latest-4-cores to ubuntu-latest-16-cores and also increasing the value used for --max-old-space-size. Ref.: https://nodejs.org/api/cli.html#--max-old-space-sizesize-in-megabytes

chrisui commented 1 year ago

Following up we've tried changing from ubuntu-latest-4-cores to a specified ubuntu-18-4-cores (defined as ubuntu 18.04) and still see the same issue on our on: pull_request: types: [opened, synchronize] jobs. Worth noting the jobs that succeed with the same runner config and workflow job steps are all triggered on: push.

https://github.com/gaia-family/monorepo/actions/runs/4165179342/jobs/7207816997

Edit:

Tried same config on branch push and failed the same way as pull_request opened/synced.

on:
  push:
    branches-ignore:
      - main
      - preview
      - production

https://github.com/gaia-family/monorepo/actions/runs/4165419445/attempts/1

chrisui commented 1 year ago

Hey @erik-bershel I think I've identified the root cause of the issue here. In the jobs that were failing we had this as the last run step of an action:

    # tidy up any background node processes if they exist
    - run: |
        killall node || true
      shell: bash

It would appear that there's some environmental inconsistency (perhaps even documented but not obvious) that is meaning the assumed sandbox we're running our steps in is not actually sandboxed as we might expect in these larger runners (compared to the standard runners). Presuming we end up killing a node based process managing the github runner job and so it appears as terminated.

sunnyoswalcro commented 1 year ago

seems like issue in ubuntu-latest, tried ubuntu-20.04 and passing consistently