actions / runner

The Runner for GitHub Actions :rocket:
https://github.com/features/actions
MIT License
4.89k stars 959 forks source link

Workflow failure due to runner shutdown/stoppage #2040

Open hamidgg opened 2 years ago

hamidgg commented 2 years ago

Description

Since 30 July 2022, our workflow fails with the following message:

"The self-hosted runner: ***** lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error."

We run our workflow on an AWS EC2 instance which is always connected and has enough resources (CPU/memory). The above failure happens even for the parts of the workflow that don't require high utilization of CPU/memory.

It seems that runner loses communication with GitHub and does not continue running the job.

Log

[2022-08-05 09:05:47Z INFO JobServer] Caught exception during append web console line to websocket, let's fallback to sending via non-websocket call (total calls: 48, failed calls: 2, websocket state: Open). [2022-08-05 09:05:47Z ERR JobServer] System.Net.WebSockets.WebSocketException (0x80004005): The remote party closed the WebSocket connection without completing the close handshake. ---> System.IO.IOException: Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host.. ---> System.Net.Sockets.SocketException (10054): An existing connection was forcibly closed by the remote host.

Runner Version and Platform

Version 2.294.0 and runs on Windows Server 2016

AvaStancu commented 2 years ago

@hamidgg would it be possible to share the RunnerListener and RunnerWorker logs too?

liu-shaojun commented 2 years ago

I got the same error when running job in the self-hosted runners

The self-hosted runner: xxxx lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

other people also have the same error message https://github.com/actions/runner/issues/1546#issuecomment-1204016156 https://github.com/actions/runner/issues/1546#issuecomment-1214570074 @pinggao187

Hi expert @AvaStancu, could you help to check on this? This issue is kind of blocking our progress...

seantleonard commented 2 years ago

I am getting this as well. Unfortunately the other issue referenced is closed but has many many reports (even after closure) about the same behavior.

hamidgg commented 2 years ago

@AvaStancu Sorry for my delayed reply. I've been waiting for another failure to get the RunnerListener and RunnerWorker logs as previous logs were cleaned up. I'll get back to you with the logs once a similar failure happens (hopefully not :D).

zaknafein83 commented 2 years ago

same situation. I use AWS EC2 and get, after first run, this error

[2022-09-16 10:42:53Z INFO JobServerQueue] All queue process tasks have been stopped, and all queues are drained. [2022-09-16 10:42:53Z INFO TempDirectoryManager] Cleaning runner temp folder: /home/ubuntu/actions-runner/_work/_temp [2022-09-16 10:42:53Z INFO HostContext] Well known directory 'Bin': '/home/ubuntu/actions-runner/bin' [2022-09-16 10:42:53Z INFO HostContext] Well known directory 'Root': '/home/ubuntu/actions-runner' [2022-09-16 10:42:53Z INFO HostContext] Well known directory 'Diag': '/home/ubuntu/actions-runner/_diag' [2022-09-16 10:42:53Z INFO HostContext] Well known config file 'Telemetry': '/home/ubuntu/actions-runner/_diag/.telemetry' [2022-09-16 10:42:53Z INFO JobRunner] Raising job completed event [2022-09-16 10:42:53Z ERR GitHubActionsService] POST request to https://pipelines.actions.githubusercontent.com/HCYdTxD8O2BMG4LvM5MKcb35EY0sH1wNedn0yWzJce2QlAajYJ/00000000-0000-0000-0000-000000000000/_apis/distributedtask/hubs/Actions/plans/3c6e1db1-0d9c-4bbe-b2fe-376050e30856/events failed. HTTP Status: BadRequest, AFD Ref: Ref A: 0FFF862B7A304ECE8208266780675163 Ref B: MIL30EDGE1321 Ref C: 2022-09-16T10:42:53Z [2022-09-16 10:42:53Z ERR JobRunner] TaskOrchestrationPlanTerminatedException received, while attempting to raise JobCompletedEvent for job ca395085-040a-526b-2ce8-bdc85f692774. [2022-09-16 10:42:53Z ERR JobRunner] GitHub.DistributedTask.WebApi.TaskOrchestrationPlanTerminatedException: Orchestration plan 3c6e1db1-0d9c-4bbe-b2fe-376050e30856 is not in a runnable state. at GitHub.Services.WebApi.VssHttpClientBase.HandleResponseAsync(HttpResponseMessage response, CancellationToken cancellationToken) at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken) at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpMethod method, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable1 queryParameters, Object userState, CancellationToken cancellationToken) at GitHub.DistributedTask.WebApi.TaskHttpClient.RaisePlanEventAsync[T](Guid scopeIdentifier, String planType, Guid planId, T eventData, CancellationToken cancellationToken, Object userState) at GitHub.Runner.Worker.JobRunner.CompleteJobAsync(IJobServer jobServer, IExecutionContext jobContext, AgentJobRequestMessage message, Nullable1 taskResult)

still waiting for a solution :(

lankmiler commented 2 years ago

@zaknafein83 did you resolved it? @AvaStancu I have the same issue now. We're using ec2 instance and we're stopping at and of workflow and starting at the start of it.

lankmiler commented 2 years ago
[2022-09-22 11:19:56Z INFO HostContext] Well known directory 'Work': '/home/ec2-user/actions-runner/_work'
[2022-09-22 11:19:57Z INFO JobServer] Caught exception during append web console line to websocket, let's fallback to sending via non-websocket call (total calls: 21, failed calls: 1, websocket state: Open).
[2022-09-22 11:19:57Z ERR  JobServer] System.Net.WebSockets.WebSocketException (2): The remote party closed the WebSocket connection without completing the close handshake. ---> System.IO.IOException: Unable to write data to the transport connection: Broken pipe.
 ---> System.Net.Sockets.SocketException (32): Broken pipe
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.SendAsyncForNetworkStream(Socket socket, CancellationToken cancellationToken)
   at System.Net.Sockets.NetworkStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Security.SslStream.WriteSingleChunk[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at System.Net.Security.SslStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnection.WriteToStreamAsync(ReadOnlyMemory`1 source, Boolean async)
   at System.Net.Http.HttpConnection.WriteWithoutBufferingAsync(ReadOnlyMemory`1 source, Boolean async)
   at System.Net.Http.HttpConnection.RawConnectionStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendAsync(ReadOnlyMemory`1 buffer, WebSocketMessageType messageType, WebSocketMessageFlags messageFlags, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendAsync(ArraySegment`1 buffer, WebSocketMessageType messageType, Boolean endOfMessage, CancellationToken cancellationToken)
   at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
   at GitHub.Runner.Common.JobServerQueue.ProcessWebConsoleLinesQueueAsync(Boolean runOnce)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.DelayPromise.CompleteTimedOut()
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()
   at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
   at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
   at System.Threading.UnmanagedThreadPoolWorkItem.System.Threading.IThreadPoolWorkItem.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()
--- End of stack trace from previous location ---
zaknafein83 commented 2 years ago

@zaknafein83 did you resolved it? @AvaStancu I have the same issue now. We're using ec2 instance and we're stopping at and of workflow and starting at the start of it.

not yet, I must restart my instance every time

chantra commented 2 years ago

[taking my comment out, I think I got some logs mixed up]

Benikz commented 1 year ago

Hello all,

I would like to point out the big issue here. The lost communication issue seems to be seen all over the place (see above). We are hosting our own runners now (Github Enterprise) and we see this very often, but we cannot pinpoint the root cause.

We have enough RAM and good internet connection. We think that the runners do not receive enough CPU time when we build applications. Although we expect the connection to stabilize sooner or later. For testing purpose, we tried to increase niceness of the run.sh script. This seems to not have any positive effect. The runners are ephemeral, although this should not be an issue.

The run.sh script does not fail or succeed. It just stops without any clear error.

Whatever the reason: We expect runners to be as stable as Jenkins nodes. It should not matter if the system is overloaded, when building. The connection is dying randomly. Sometimes our builds even succeed, as expected. But sometimes, they just die.

Can this issue be emphasized more, please? This is a core feature: connection stability. If this is not possible, then runners are simply unusable.

Sorry for being harsh, but this is literally an issue for months now.

Stay healthy!

BR


Edit: We use:

Benikz commented 1 year ago

I did some more investigation and apparently it was a problem on our side while instantiating the runner via a VM through systemd management. The problem was a mixture of how our VM solution works with systemd.

I am not sure about the others now... It seems to be stable now, after fixing our services.


For the interested people: We use Vagrant, which uses virtual box under the hood. Systemd instatiation works pretty fine (@ symbol in the name), but Virtualbox uses a service process, that is bound to only one systemd service (CGroup). When the systemd instance with the Vbox service was done, it stopped all of the other boxes, which then lead to the communication problem. This makes sense, as the communication to the provider was simply cut off. It has nothing to do with the runner issue mentioned in this post, but I wanted to point out that this issue was caused by our own backend. I would suggest the people here to check how they create their Runners and maybe they are being stopped preliminary by something else. Sorry for being harsh again.

Best regards!

ololobus commented 1 year ago

We started having the same issues ~1 month ago, we use custom-sized GitHub-hosted runners, though. Not sure, but it feels like it happens more often with 16 core runners.

The message is:

The hosted runner: XXX lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

If I try to get raw logs, they are almost empty, although a few job steps succeeded:

2022-12-29T13:14:48.9912020Z Requested labels: XXX
2022-12-29T13:14:48.9912099Z Job defined at: yyy/xxx/.github/workflows/testing.yml@refs/pull/1111/merge
2022-12-29T13:14:48.9912131Z Waiting for a runner to pick up this job...
2022-12-29T13:17:11.0365729Z Job is about to start running on the runner: XXX (organization)

Not sure this is related, but initially they were defined as Ubuntu 20.04 runners, but after December 15 they started using 22.04, the warning was

Runner 'XXX' will start to use Ubuntu 22.04 starting from 15 December

UPD: I 'fixed' this by just re-creating the runners group in the GitHub UI. I.e. we had gha-ubuntu-20.04-16cores (automatically upgraded to 22.04 by GitHub), so I created and used gha-ubuntu-22.04-8cores instead. And it magically helped, all runs are passing now without any problems. Leaving it here as it may help someone.

And this makes me wonder, why? I thought that runners group is just some stateless abstraction to limit usage, but it appears to be something statefull, i.e. it binds to some infra (?), so if it has some problems -- you will have too, and re-creating the group may help.

jbkc85 commented 1 year ago

I am now having this experience with self-hosted runners in AWS with no apparent cause. Disk is fine, Mem is fine, CPU is fine - but just randomly a GitHub runner decides it can no longer talk with the GitHub web sockets and fails to reconnect.

That being said, I see numerous times (5-10%) of the web socket connections during a workflow run are error'ing out and causing the web socket process to reconnect. Not sure if this is related, or a red herring.

chtompki commented 1 year ago

+1 to this thread - and I'm even using --ephemeral runners which should accommodate for one job on one runner. I'm thinking about specifically stacking jobs onto a single runner with metadata and then deleting that runner when everything is done, but that defeats the purpose

parker-vv commented 1 year ago

+1 same issue.

kirillmorozov commented 1 year ago

Faced the same issue when using AWS EC2 instances as self-hosted runners.

The self-hosted runner: i-020cc48127fe3f0bc lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

densto88 commented 1 year ago

We're seeing this in hosted runners also, here's the stack trace from our worker logs...

[2023-04-09 18:18:04Z ERR  JobServer] #####################################################
[2023-04-09 18:18:04Z ERR  JobServer] System.IO.IOException: Unable to write data to the transport connection: Broken pipe.
 ---> System.Net.Sockets.SocketException (32): Broken pipe
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.SendAsyncForNetworkStream(Socket socket, CancellationToken cancellationToken)
   at System.Net.Sockets.NetworkStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Security.SslStream.WriteSingleChunk[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at System.Net.Security.SslStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnection.WriteToStreamAsync(ReadOnlyMemory`1 source, Boolean async)
   at System.Net.Http.HttpConnection.RawConnectionStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendAsync(ReadOnlyMemory`1 buffer, WebSocketMessageType messageType, WebSocketMessageFlags messageFlags, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendAsync(ArraySegment`1 buffer, WebSocketMessageType messageType, Boolean endOfMessage, CancellationToken cancellationToken)
   at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
   at GitHub.Runner.Common.JobServerQueue.ProcessWebConsoleLinesQueueAsync(Boolean runOnce)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.DelayPromise.CompleteTimedOut()
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()
   at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
   at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
   at System.Threading.UnmanagedThreadPoolWorkItem.System.Threading.IThreadPoolWorkItem.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
--- End of stack trace from previous location ---

   --- End of inner exception stack trace ---
   at System.Net.Security.SslStream.<WriteSingleChunk>g__CompleteWriteAsync|182_1[TIOAdapter](ValueTask writeTask, Byte[] bufferToReturn)
   at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
[2023-04-09 18:18:04Z ERR  JobServer] #####################################################
[2023-04-09 18:18:04Z ERR  JobServer] System.Net.Sockets.SocketException (32): Broken pipe
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.SendAsyncForNetworkStream(Socket socket, CancellationToken cancellationToken)
   at System.Net.Sockets.NetworkStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Security.SslStream.WriteSingleChunk[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at System.Net.Security.SslStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnection.WriteToStreamAsync(ReadOnlyMemory`1 source, Boolean async)
   at System.Net.Http.HttpConnection.RawConnectionStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendAsync(ReadOnlyMemory`1 buffer, WebSocketMessageType messageType, WebSocketMessageFlags messageFlags, CancellationToken cancellationToken)
   at System.Net.WebSockets.ManagedWebSocket.SendAsync(ArraySegment`1 buffer, WebSocketMessageType messageType, Boolean endOfMessage, CancellationToken cancellationToken)
   at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
   at GitHub.Runner.Common.JobServerQueue.ProcessWebConsoleLinesQueueAsync(Boolean runOnce)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.DelayPromise.CompleteTimedOut()
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()
   at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
   at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
   at System.Threading.UnmanagedThreadPoolWorkItem.System.Threading.IThreadPoolWorkItem.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
--- End of stack trace from previous location ---

[2023-04-09 18:18:04Z INFO JobServer] Websocket is not open, let's attempt to connect back again with random backoff 00:00:00.2370000 ms (total calls: 159, failed calls: 12).
kirillmorozov commented 1 year ago

Update to my case:

I was able to resolve the issue by using larger EC2 instances, so yeah, CPU/Memory starvation was the cause of this problem.

Togtja commented 9 months ago

We are also experiencing similar problems. We are aslo running it as --ephemeral, we have 4 build in parallel compiling C++ code. The communiction loss seems to mostly happend during an artefact upload to GitHub using the Upload Action (https://github.com/actions/upload-artifact). However, it has also lost connection duing a post-checkout stage. We have attempted to use some EC2 machines that are an insanly overkill for the task, however, the problem still seems to presist.

Since we are running 4 build at the time, it nearly always fails in one of them. 3 are Ubuntu based and 1 is Windows, but it does not seem to just be affecting 1 type of OS. The GitHub runner version is always the newest, as they are created via a script that fetches the newset runner.

Our current "solution" is just to "rebuild failed jobs" untill it works. However, longterm this is unacceptable

Similar to https://github.com/actions/runner/issues/2624#issuecomment-1592427664

machulav commented 8 months ago

We have the same issue, which happens from time to time with our runners

jbkc85 commented 8 months ago

For my particular scenario, the web socket errors are a red herring and aren't necessarily associated with the random loss of a runner.

IF in AWS and running on SPOT instances, depending on the SPOT instance settings and autoscaling, its very possible that the SPOT instances are heavily associated with the random loss of a GH runner instance. In our particular case, we went from SPOT to ON_DEMAND node groups and went from 30-40% failure rates to 0.01%

signor-mike commented 4 months ago

I downgraded Ubuntu from 22.04 LTS to 20.04 LTS and the workflow is no longer exhausting anything.