Open hamidgg opened 2 years ago
@hamidgg would it be possible to share the RunnerListener and RunnerWorker logs too?
I got the same error when running job in the self-hosted runners
The self-hosted runner: xxxx lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
other people also have the same error message https://github.com/actions/runner/issues/1546#issuecomment-1204016156 https://github.com/actions/runner/issues/1546#issuecomment-1214570074 @pinggao187
Hi expert @AvaStancu, could you help to check on this? This issue is kind of blocking our progress...
I am getting this as well. Unfortunately the other issue referenced is closed but has many many reports (even after closure) about the same behavior.
@AvaStancu Sorry for my delayed reply. I've been waiting for another failure to get the RunnerListener and RunnerWorker logs as previous logs were cleaned up. I'll get back to you with the logs once a similar failure happens (hopefully not :D).
same situation. I use AWS EC2 and get, after first run, this error
[2022-09-16 10:42:53Z INFO JobServerQueue] All queue process tasks have been stopped, and all queues are drained. [2022-09-16 10:42:53Z INFO TempDirectoryManager] Cleaning runner temp folder: /home/ubuntu/actions-runner/_work/_temp [2022-09-16 10:42:53Z INFO HostContext] Well known directory 'Bin': '/home/ubuntu/actions-runner/bin' [2022-09-16 10:42:53Z INFO HostContext] Well known directory 'Root': '/home/ubuntu/actions-runner' [2022-09-16 10:42:53Z INFO HostContext] Well known directory 'Diag': '/home/ubuntu/actions-runner/_diag' [2022-09-16 10:42:53Z INFO HostContext] Well known config file 'Telemetry': '/home/ubuntu/actions-runner/_diag/.telemetry' [2022-09-16 10:42:53Z INFO JobRunner] Raising job completed event [2022-09-16 10:42:53Z ERR GitHubActionsService] POST request to https://pipelines.actions.githubusercontent.com/HCYdTxD8O2BMG4LvM5MKcb35EY0sH1wNedn0yWzJce2QlAajYJ/00000000-0000-0000-0000-000000000000/_apis/distributedtask/hubs/Actions/plans/3c6e1db1-0d9c-4bbe-b2fe-376050e30856/events failed. HTTP Status: BadRequest, AFD Ref: Ref A: 0FFF862B7A304ECE8208266780675163 Ref B: MIL30EDGE1321 Ref C: 2022-09-16T10:42:53Z [2022-09-16 10:42:53Z ERR JobRunner] TaskOrchestrationPlanTerminatedException received, while attempting to raise JobCompletedEvent for job ca395085-040a-526b-2ce8-bdc85f692774. [2022-09-16 10:42:53Z ERR JobRunner] GitHub.DistributedTask.WebApi.TaskOrchestrationPlanTerminatedException: Orchestration plan 3c6e1db1-0d9c-4bbe-b2fe-376050e30856 is not in a runnable state. at GitHub.Services.WebApi.VssHttpClientBase.HandleResponseAsync(HttpResponseMessage response, CancellationToken cancellationToken) at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken) at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpMethod method, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable1 queryParameters, Object userState, CancellationToken cancellationToken) at GitHub.DistributedTask.WebApi.TaskHttpClient.RaisePlanEventAsync[T](Guid scopeIdentifier, String planType, Guid planId, T eventData, CancellationToken cancellationToken, Object userState) at GitHub.Runner.Worker.JobRunner.CompleteJobAsync(IJobServer jobServer, IExecutionContext jobContext, AgentJobRequestMessage message, Nullable1 taskResult)
still waiting for a solution :(
@zaknafein83 did you resolved it? @AvaStancu I have the same issue now. We're using ec2 instance and we're stopping at and of workflow and starting at the start of it.
[2022-09-22 11:19:56Z INFO HostContext] Well known directory 'Work': '/home/ec2-user/actions-runner/_work'
[2022-09-22 11:19:57Z INFO JobServer] Caught exception during append web console line to websocket, let's fallback to sending via non-websocket call (total calls: 21, failed calls: 1, websocket state: Open).
[2022-09-22 11:19:57Z ERR JobServer] System.Net.WebSockets.WebSocketException (2): The remote party closed the WebSocket connection without completing the close handshake. ---> System.IO.IOException: Unable to write data to the transport connection: Broken pipe.
---> System.Net.Sockets.SocketException (32): Broken pipe
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.SendAsyncForNetworkStream(Socket socket, CancellationToken cancellationToken)
at System.Net.Sockets.NetworkStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
at System.Net.Security.SslStream.WriteSingleChunk[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at System.Net.Security.SslStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
at System.Net.Http.HttpConnection.WriteToStreamAsync(ReadOnlyMemory`1 source, Boolean async)
at System.Net.Http.HttpConnection.WriteWithoutBufferingAsync(ReadOnlyMemory`1 source, Boolean async)
at System.Net.Http.HttpConnection.RawConnectionStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
at System.Net.WebSockets.ManagedWebSocket.SendFrameAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, CancellationToken cancellationToken)
at System.Net.WebSockets.ManagedWebSocket.SendAsync(ReadOnlyMemory`1 buffer, WebSocketMessageType messageType, WebSocketMessageFlags messageFlags, CancellationToken cancellationToken)
at System.Net.WebSockets.ManagedWebSocket.SendAsync(ArraySegment`1 buffer, WebSocketMessageType messageType, Boolean endOfMessage, CancellationToken cancellationToken)
at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
at GitHub.Runner.Common.JobServerQueue.ProcessWebConsoleLinesQueueAsync(Boolean runOnce)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
at System.Threading.Tasks.Task.DelayPromise.CompleteTimedOut()
at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
at System.Threading.TimerQueue.FireNextTimers()
at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
at System.Threading.UnmanagedThreadPoolWorkItem.System.Threading.IThreadPoolWorkItem.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
at System.Threading.Thread.StartCallback()
--- End of stack trace from previous location ---
@zaknafein83 did you resolved it? @AvaStancu I have the same issue now. We're using ec2 instance and we're stopping at and of workflow and starting at the start of it.
not yet, I must restart my instance every time
[taking my comment out, I think I got some logs mixed up]
Hello all,
I would like to point out the big issue here. The lost communication issue seems to be seen all over the place (see above). We are hosting our own runners now (Github Enterprise) and we see this very often, but we cannot pinpoint the root cause.
We have enough RAM and good internet connection. We think that the runners do not receive enough CPU time when we build applications. Although we expect the connection to stabilize sooner or later. For testing purpose, we tried to increase niceness of the run.sh script. This seems to not have any positive effect. The runners are ephemeral, although this should not be an issue.
The run.sh script does not fail or succeed. It just stops without any clear error.
Whatever the reason: We expect runners to be as stable as Jenkins nodes. It should not matter if the system is overloaded, when building. The connection is dying randomly. Sometimes our builds even succeed, as expected. But sometimes, they just die.
Can this issue be emphasized more, please? This is a core feature: connection stability. If this is not possible, then runners are simply unusable.
Sorry for being harsh, but this is literally an issue for months now.
Stay healthy!
BR
Edit: We use:
I did some more investigation and apparently it was a problem on our side while instantiating the runner via a VM through systemd management. The problem was a mixture of how our VM solution works with systemd.
I am not sure about the others now... It seems to be stable now, after fixing our services.
For the interested people: We use Vagrant, which uses virtual box under the hood. Systemd instatiation works pretty fine (@ symbol in the name), but Virtualbox uses a service process, that is bound to only one systemd service (CGroup). When the systemd instance with the Vbox service was done, it stopped all of the other boxes, which then lead to the communication problem. This makes sense, as the communication to the provider was simply cut off. It has nothing to do with the runner issue mentioned in this post, but I wanted to point out that this issue was caused by our own backend. I would suggest the people here to check how they create their Runners and maybe they are being stopped preliminary by something else. Sorry for being harsh again.
Best regards!
We started having the same issues ~1 month ago, we use custom-sized GitHub-hosted runners, though. Not sure, but it feels like it happens more often with 16 core runners.
The message is:
The hosted runner: XXX lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
If I try to get raw logs, they are almost empty, although a few job steps succeeded:
2022-12-29T13:14:48.9912020Z Requested labels: XXX
2022-12-29T13:14:48.9912099Z Job defined at: yyy/xxx/.github/workflows/testing.yml@refs/pull/1111/merge
2022-12-29T13:14:48.9912131Z Waiting for a runner to pick up this job...
2022-12-29T13:17:11.0365729Z Job is about to start running on the runner: XXX (organization)
Not sure this is related, but initially they were defined as Ubuntu 20.04 runners, but after December 15 they started using 22.04, the warning was
Runner 'XXX' will start to use Ubuntu 22.04 starting from 15 December
UPD: I 'fixed' this by just re-creating the runners group in the GitHub UI. I.e. we had gha-ubuntu-20.04-16cores
(automatically upgraded to 22.04 by GitHub), so I created and used gha-ubuntu-22.04-8cores
instead. And it magically helped, all runs are passing now without any problems. Leaving it here as it may help someone.
And this makes me wonder, why? I thought that runners group is just some stateless abstraction to limit usage, but it appears to be something statefull, i.e. it binds to some infra (?), so if it has some problems -- you will have too, and re-creating the group may help.
I am now having this experience with self-hosted runners in AWS with no apparent cause. Disk is fine, Mem is fine, CPU is fine - but just randomly a GitHub runner decides it can no longer talk with the GitHub web sockets and fails to reconnect.
That being said, I see numerous times (5-10%) of the web socket connections during a workflow run are error'ing out and causing the web socket process to reconnect. Not sure if this is related, or a red herring.
+1 to this thread - and I'm even using --ephemeral
runners which should accommodate for one job on one runner. I'm thinking about specifically stacking jobs onto a single runner with metadata and then deleting that runner when everything is done, but that defeats the purpose
+1 same issue.
Faced the same issue when using AWS EC2 instances as self-hosted runners.
The self-hosted runner: i-020cc48127fe3f0bc lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
We're seeing this in hosted runners also, here's the stack trace from our worker logs...
[2023-04-09 18:18:04Z ERR JobServer] #####################################################
[2023-04-09 18:18:04Z ERR JobServer] System.IO.IOException: Unable to write data to the transport connection: Broken pipe.
---> System.Net.Sockets.SocketException (32): Broken pipe
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.SendAsyncForNetworkStream(Socket socket, CancellationToken cancellationToken)
at System.Net.Sockets.NetworkStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
at System.Net.Security.SslStream.WriteSingleChunk[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at System.Net.Security.SslStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
at System.Net.Http.HttpConnection.WriteToStreamAsync(ReadOnlyMemory`1 source, Boolean async)
at System.Net.Http.HttpConnection.RawConnectionStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
at System.Net.WebSockets.ManagedWebSocket.SendAsync(ReadOnlyMemory`1 buffer, WebSocketMessageType messageType, WebSocketMessageFlags messageFlags, CancellationToken cancellationToken)
at System.Net.WebSockets.ManagedWebSocket.SendAsync(ArraySegment`1 buffer, WebSocketMessageType messageType, Boolean endOfMessage, CancellationToken cancellationToken)
at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
at GitHub.Runner.Common.JobServerQueue.ProcessWebConsoleLinesQueueAsync(Boolean runOnce)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
at System.Threading.Tasks.Task.DelayPromise.CompleteTimedOut()
at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
at System.Threading.TimerQueue.FireNextTimers()
at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
at System.Threading.UnmanagedThreadPoolWorkItem.System.Threading.IThreadPoolWorkItem.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
--- End of stack trace from previous location ---
--- End of inner exception stack trace ---
at System.Net.Security.SslStream.<WriteSingleChunk>g__CompleteWriteAsync|182_1[TIOAdapter](ValueTask writeTask, Byte[] bufferToReturn)
at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
[2023-04-09 18:18:04Z ERR JobServer] #####################################################
[2023-04-09 18:18:04Z ERR JobServer] System.Net.Sockets.SocketException (32): Broken pipe
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.SendAsyncForNetworkStream(Socket socket, CancellationToken cancellationToken)
at System.Net.Sockets.NetworkStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
at System.Net.Security.SslStream.WriteSingleChunk[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](TIOAdapter writeAdapter, ReadOnlyMemory`1 buffer)
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at System.Net.Security.SslStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
at System.Net.Http.HttpConnection.WriteToStreamAsync(ReadOnlyMemory`1 source, Boolean async)
at System.Net.Http.HttpConnection.RawConnectionStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, Task lockTask, CancellationToken cancellationToken)
at System.Net.WebSockets.ManagedWebSocket.SendAsync(ReadOnlyMemory`1 buffer, WebSocketMessageType messageType, WebSocketMessageFlags messageFlags, CancellationToken cancellationToken)
at System.Net.WebSockets.ManagedWebSocket.SendAsync(ArraySegment`1 buffer, WebSocketMessageType messageType, Boolean endOfMessage, CancellationToken cancellationToken)
at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at GitHub.Runner.Common.JobServer.AppendTimelineRecordFeedAsync(Guid scopeIdentifier, String hubName, Guid planId, Guid timelineId, Guid timelineRecordId, Guid stepId, IList`1 lines, Nullable`1 startLine, CancellationToken cancellationToken)
at GitHub.Runner.Common.JobServerQueue.ProcessWebConsoleLinesQueueAsync(Boolean runOnce)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
at System.Threading.Tasks.Task.DelayPromise.CompleteTimedOut()
at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
at System.Threading.TimerQueue.FireNextTimers()
at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
at System.Threading.UnmanagedThreadPoolWorkItem.ExecuteUnmanagedThreadPoolWorkItem(IntPtr callback, IntPtr state)
at System.Threading.UnmanagedThreadPoolWorkItem.System.Threading.IThreadPoolWorkItem.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
--- End of stack trace from previous location ---
[2023-04-09 18:18:04Z INFO JobServer] Websocket is not open, let's attempt to connect back again with random backoff 00:00:00.2370000 ms (total calls: 159, failed calls: 12).
Update to my case:
I was able to resolve the issue by using larger EC2 instances, so yeah, CPU/Memory starvation was the cause of this problem.
We are also experiencing similar problems.
We are aslo running it as --ephemeral
, we have 4 build in parallel compiling C++ code.
The communiction loss seems to mostly happend during an artefact upload to GitHub using the Upload Action (https://github.com/actions/upload-artifact). However, it has also lost connection duing a post-checkout stage.
We have attempted to use some EC2 machines that are an insanly overkill for the task, however, the problem still seems to presist.
Since we are running 4 build at the time, it nearly always fails in one of them. 3 are Ubuntu based and 1 is Windows, but it does not seem to just be affecting 1 type of OS. The GitHub runner version is always the newest, as they are created via a script that fetches the newset runner.
Our current "solution" is just to "rebuild failed jobs" untill it works. However, longterm this is unacceptable
Similar to https://github.com/actions/runner/issues/2624#issuecomment-1592427664
We have the same issue, which happens from time to time with our runners
For my particular scenario, the web socket errors are a red herring and aren't necessarily associated with the random loss of a runner.
IF in AWS and running on SPOT instances, depending on the SPOT instance settings and autoscaling, its very possible that the SPOT instances are heavily associated with the random loss of a GH runner instance. In our particular case, we went from SPOT to ON_DEMAND node groups and went from 30-40% failure rates to 0.01%
I downgraded Ubuntu from 22.04 LTS to 20.04 LTS and the workflow is no longer exhausting anything.
Description
Since 30 July 2022, our workflow fails with the following message:
"The self-hosted runner: ***** lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error."
We run our workflow on an AWS EC2 instance which is always connected and has enough resources (CPU/memory). The above failure happens even for the parts of the workflow that don't require high utilization of CPU/memory.
It seems that runner loses communication with GitHub and does not continue running the job.
Log
[2022-08-05 09:05:47Z INFO JobServer] Caught exception during append web console line to websocket, let's fallback to sending via non-websocket call (total calls: 48, failed calls: 2, websocket state: Open). [2022-08-05 09:05:47Z ERR JobServer] System.Net.WebSockets.WebSocketException (0x80004005): The remote party closed the WebSocket connection without completing the close handshake. ---> System.IO.IOException: Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host.. ---> System.Net.Sockets.SocketException (10054): An existing connection was forcibly closed by the remote host.
Runner Version and Platform
Version 2.294.0 and runs on Windows Server 2016