Closed JakubMosakowski closed 1 year ago
@JakubMosakowski we cannot do any investigation without additional info. I see that your machine got the shutdown signal. Most often, this means that the resources consumed in the process exceeded the limits. We can theoretically check whether this is so. But we need to see an example of the pipeline that caused the outage and links to failed uses. Even if they belong to a private repository.
Sure.
Examples of failing builds: https://github.com/SpotOnInc/android-omnichannel/actions/runs/3638760141 https://github.com/SpotOnInc/android-omnichannel/actions/runs/3638651891 https://github.com/SpotOnInc/android-omnichannel/actions/runs/3638619301 https://github.com/SpotOnInc/android-omnichannel/actions/runs/3638519339 https://github.com/SpotOnInc/android-omnichannel/actions/runs/3638515786
The interesting part is that it doesn't seem to be related to any of our changes. I created a branch that is reverted by the last X commits (to the point in history where our builds were smooth) and they are not passing anymore.
We are also seeing this after upgrading our self hosted runners from 20.04 to 22.04 with no other seemingly related changes. Do the 22.04 runners have more conservative limits even when using self hosted?
The same happens to us in private repo. Builds started to randomly fail with this error:
We didn't do any significant changes to workflows.
Hi @ihor-panasiuk95, please send me a links to workflow runs both with positive and negative results.
@erik-bershel will you be able to visit them taking into account that they are in private repo?
@ihor-panasiuk95 it's not a problem. There is no need to check what is going on in your private repository in the first step. I want to check the load on agents and compare successful and failed jobs. If the information is not enough, then we will discuss the repro-steps. For example: https://github.com/owner/repo/actions/runs/runID or https://github.com/erik-bershel/erik-tests/actions/runs/3680567148.
@erik-bershel Negative - https://github.com/anecdotes-ai/frontend/actions/runs/3742047670 Positive (I replaced ubuntu-latest with ubuntu-22.04 and it started to work) - https://github.com/anecdotes-ai/frontend/actions/runs/3748531101
I find that this issue only occurs when using ubuntu-latest
which means ubuntu-22.04
. However, it doesn't happen when using ubuntu-20.04
Negative - https://github.com/qhy040404/LibChecker/actions/runs/3747087072
Positive - https://github.com/qhy040404/LibChecker/actions/runs/3748900876/jobs/6366765819
We are also seeing these errors regularly. Link to one of our most recent runs: https://github.com/rstudio/connect/actions/runs/3912304431/jobs/6687076068
The output from the job (a docker compose build
) is also highly repetitive; I don't believe we had seen that phenomenon previous to these 143 termination problems.
I've been seeing similar issues where I either get The runner has received a shutdown signal.
or some of my processes just never start (using Gradle and Kotlin, the Gradle daemon starts, but the Kotlin daemon never starts).
I just recently began experiencing this issue. I have never experienced it before.
Here's the error I receive:
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
Process completed with exit code 143.
Here's a link to one of our recent runs: https://github.com/DevPsyLab/DataAnalysis/actions/runs/3964736564/attempts/2
We are seeing this issue consistently on pr/branch workflows at the step run with configure-aws-credentials on ubuntu-latest-4-cores
. The weird thing is we have an identical step that runs in a different workflow (ie. our CD) with the exact same runner that succeeds. Debug logs below. Looks like the job is killed consistently immediately upon starting to execute this step.
##[debug]Evaluating condition for step: 'Run aws-actions/configure-aws-credentials@master'
##[debug]Evaluating: success()
##[debug]Evaluating success:
##[debug]=> true
##[debug]Result: true
##[debug]Starting: Run aws-actions/configure-aws-credentials@master
##[debug]Register post job cleanup for action: aws-actions/configure-aws-credentials@master
##[debug]Loading inputs
##[debug]Evaluating: secrets.DEPLOY_PREVIEW_ROLE
##[debug]Evaluating Index:
##[debug]..Evaluating secrets:
##[debug]..=> Object
##[debug]..Evaluating String:
##[debug]..=> 'DEPLOY_PREVIEW_ROLE'
##[debug]=> '***'
##[debug]Result: '***'
##[debug]Evaluating: env.AWS_DEFAULT_REGION
##[debug]Evaluating Index:
##[debug]..Evaluating env:
##[debug]..=> Object
##[debug]..Evaluating String:
##[debug]..=> 'AWS_DEFAULT_REGION'
##[debug]=> 'eu-west-[2](https://github.com/gaia-family/monorepo/actions/runs/4024185164/jobs/6943228995#step:5:2)'
##[debug]Result: '***'
##[debug]Loading env
Run aws-actions/configure-aws-credentials@master
##[debug]Re-evaluate condition on job cancellation for step: 'Run aws-actions/configure-aws-credentials@master'.
##[debug]Skip Re-evaluate condition on runner shutdown.
Error: The operation was canceled.
##[debug]System.OperationCanceledException: The operation was canceled.
##[debug] at System.Threading.CancellationToken.ThrowOperationCanceledException()
##[debug] at GitHub.Runner.Sdk.ProcessInvoker.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel`1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)
##[debug] at GitHub.Runner.Common.ProcessInvokerWrapper.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel`1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)
##[debug] at GitHub.Runner.Worker.Handlers.DefaultStepHost.ExecuteAsync(IExecutionContext context, String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Boolean inheritConsoleHandler, String standardInInput, CancellationToken cancellationToken)
##[debug] at GitHub.Runner.Worker.Handlers.NodeScriptActionHandler.RunAsync(ActionRunStage stage)
##[debug] at GitHub.Runner.Worker.ActionRunner.RunAsync()
##[debug] at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
##[debug]Finishing: Run aws-actions/configure-aws-credentials@master
Hey @chrisui! Please, provide links with one successful and one unsuccessful runs.
Failed: https://github.com/gaia-family/monorepo/actions/runs/4042706837 Success: https://github.com/gaia-family/monorepo/actions/runs/4045407852
@erik-bershel thanks
We've been seeing this similar issue on our builds. Usually it's only related to our cypress workflows.
2023-01-30T21:49:04.3960196Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
2023-01-30T21:49:04.5435392Z ##[error]The operation was canceled.
no one is manually stopping the action and it seems to happen randomly. We are on ubuntu-latest for our runners.
I confirm this happens to me as well
+1 This happens at random for me, often time re-running the failed runner works, but I've only had this problem with PR runners but not push event runners
Just want to follow up and say this is happening consistently on a pull-request usage of the runner. The pull-request itself is bumping to use ubuntu-latest-4-cores
over the standard runner. We have been using the same runner with practically same workflow steps on main/tag-push with no issues for a week now.
And for us it's always on the step with configure-aws-credentials that immediately fails.
Hey @chrisui, Did issue happen only with Large Runners in your case? Just to be clear.
Hey @chrisui, Did issue happen only with Large Runners in your case? Just to be clear.
Exactly. The only change on the pr that's consistently failing is the change of runner.
@wax911 @mustafaozhan @codyseibert Hi friends! May I ask you to share a links on the couple failed runs and couple green passes too? It will help us to reach root of issue. You may share links on the private repos - all we need just links without access.
@erik-bershel
This has been happening frequently where the push trigger often builds without any errors but the pull request trigger has been failing lately and most times just re-running the job typically resolves the issue, at first I though the jobs were taking too long so I add gradle-caches
@wax911 as I see in you case it might be a resources limitations hitting. Last message from runner before it was lost: "PreciseTimeStamp": 2023-01-28T11:39:55.3878501Z, "Message": [signal]{"usage":{"process":"java","stddev":7.55626377842431,"cpu":85.77531585041434}} As you can see Java process consume almost all CPU time. But it is not a final diagnose - we continue investigating reasons of this incidents.
Thank you @erik-bershel 😃
Hello, I started experiencing this exact issue today on one of my private repositories that has otherwise ran just fine for over a year. No changes to the runner. Though I did add a terraform
command to one of the scripts it executes. I don't think this would have an impact.
I am the only one with access to this private repository. I did not cancel the Action. This is happening repeatedly about 4 minutes into the run.
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
The operation was canceled.
Error: Process completed with exit code 143.
Thanks, adding a watch to this thread in the meantime.
Hi @kNoAPP, Please provide links to both positive and negative runs. I cannot say anything specific about reasons without links.
Hi @erik-bershel, I've got the same error in our workflows, as a runner we are using ubuntu-latest-4-cores
.
Basically, during the job, we have these steps: clone
-> aws-actions/configure-aws-credentials
-> setup node
-> setup cache
-> npm commands
-> docker/build-push-action
Job: https://github.com/FindHotel/daedalus/actions/runs/4125469274/jobs/7126107593 ❌
Thank you!
It turns out that my issue was a gradle issue under the hood, they even have an open issue for this: https://github.com/gradle/gradle/issues/19750
Summary: IF you have already org.gradle.jvmargs
in your gradle.properties
you need to specify all the arguments since you lose some of the default values.
I got the solution by adding this line:
org.gradle.jvmargs=-Xmx8g -XX:MaxMetaspaceSize=2g -XX:MaxPermSize=2g -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParallelGC -Dfile.encoding=UTF-8
Note: the important thing is not the values I provided, but the fact that I supply all the parameters, so you can change the values. Or ofc. not having this line at all is also a solution
I also found a temporary workaround. For me, running golangci-lint run -v
in one of my build steps seems to be the culprit. Still unsure why now, of all times, this started happening.
Facing the same issue on the following Gradle execution;
gradle(task: "clean bundleStage --parallel")
Locally the gradle command is successful. Pipeline is failing only with the recent push which majorly has the android gradle plugin updated to 7.1.3. I am still using the following initialisation code to setup the environment;
name: Deploy to Firebase App Distribution [Stage]
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- uses: actions/checkout@v3
- name: Set up JDK
uses: actions/setup-java@v3
with:
distribution: 'zulu'
java-version: 11
check-latest: true
cache: 'gradle'
- uses: ruby/setup-ruby@v1
with:
ruby-version: '3.0' # Not needed with a .ruby-version file
bundler-cache: true
Do I need to update me java version to 17?
The detailed stack trace on failure is as following;
`[16:51:07]: *** finished with errors 1208 Error: The operation was canceled. 1209
2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel
1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel
1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)Any idea what could be going wrong?
Issue resolved by adding Gradle properties flags by @mustafaozhan
Hi @erik-bershel, I've got the same error in our workflows, as a runner we are using
ubuntu-latest-4-cores
. Basically, during the job, we have these steps:clone
->aws-actions/configure-aws-credentials
->setup node
->setup cache
->npm commands
->docker/build-push-action
Job: https://github.com/FindHotel/daedalus/actions/runs/4125469274/jobs/7126107593 ❌
Thank you!
Update!
I've fixed my issue by changing the runner from ubuntu-latest-4-cores
to ubuntu-latest-16-cores
and also increasing the value used for --max-old-space-size
.
Ref.: https://nodejs.org/api/cli.html#--max-old-space-sizesize-in-megabytes
Following up we've tried changing from ubuntu-latest-4-cores
to a specified ubuntu-18-4-cores
(defined as ubuntu 18.04) and still see the same issue on our on: pull_request: types: [opened, synchronize]
jobs. Worth noting the jobs that succeed with the same runner config and workflow job steps are all triggered on: push
.
https://github.com/gaia-family/monorepo/actions/runs/4165179342/jobs/7207816997
Edit:
Tried same config on branch push and failed the same way as pull_request opened/synced.
on:
push:
branches-ignore:
- main
- preview
- production
https://github.com/gaia-family/monorepo/actions/runs/4165419445/attempts/1
Hey @erik-bershel I think I've identified the root cause of the issue here. In the jobs that were failing we had this as the last run step of an action:
# tidy up any background node processes if they exist
- run: |
killall node || true
shell: bash
It would appear that there's some environmental inconsistency (perhaps even documented but not obvious) that is meaning the assumed sandbox we're running our steps in is not actually sandboxed as we might expect in these larger runners (compared to the standard runners). Presuming we end up killing a node based process managing the github runner job and so it appears as terminated.
seems like issue in ubuntu-latest
, tried ubuntu-20.04
and passing consistently
Description
Since yesterday, our GitHub action builds started to randomly fail (we didn't change anything in our configuration). The error is not very precise, unfortunately.
The process is stopped in random stages of the build (but always after at least 15 minutes or so). Even if the build passes it takes much longer than before (~25 min clean build to ~35 min now).
Sometimes before the shutdown signal, there is also such log:
Idle daemon unexpectedly exit. This should not happen.
Workflow passes normally on builds that are shorter (for example those from cache).
Platforms affected
Runner images affected
Image version and build link
Image: ubuntu-22.04 Version: 20221127.1 Current runner version: '2.299.1'
Unfortunately, it happens on the private repo.
Is it regression?
No
Expected behavior
Job should pass
Actual behavior
Job fails
Repro steps
Looks similar to: https://github.com/actions/runner-images/issues/6680