runners from some organisations do not pick jobs

igagis commented 3 months ago

One of my github organization's runners are not picking up jobs anymore. It started recently. Runners from my other github organization do pick up jobs, those other runners are running on the same hardware as the runners from the first organization.

The runner which does not pick up the job is shown as Active in the github web UI. And the job it is supposed to pick always show

Job is about to start running on the runner: <runner-name> (organization)

until it times out and fails.

The runner itself doesn't print any logs. For example, freshly restarted runner service logs look like:

Jul 22 07:56:02 <runner-name> systemd[652]: Stopping github-act-runner '<runner-name>'...
Jul 22 07:56:02 <runner-name> systemd[652]: Stopped github-act-runner '<runner-name>'.
Jul 22 07:58:11 <runner-name> systemd[652]: Started github-act-runner '<runner-name>'.
Jul 22 07:58:12 <runner-name> runner[2009977]: Listening for Jobs: <runner-name> ( https://github.com/cppfw )

This problem could be related to #186 , but as it doesn't print any logs it is hard to say.

This could also be a problem on github side.

edit: I have enabled --trace for the runner and here are the logs it prints when I try to start the job:

Jul 22 08:11:23 stahl runner[2010079]: Http GET Request finished 202 https://pipelinesghubeus2.actions.githubusercontent.com/eMc1GyGdYp1Pn3AIiLBt1AQy39VBhA0ak6WnGv2vWqR163Rx43/_apis/distributedtask/pools/1/messages?api-version=5.1-preview&sessionId=ccdedcca-3a1f-4b20-9bcd-9b5735dcf1aa
Jul 22 08:11:23 stahl runner[2010079]: Headers:
Jul 22 08:11:23 stahl runner[2010079]: Activityid: 5d4c7dab-af15-4080-b761-d700b16a6b3f
Jul 22 08:11:23 stahl runner[2010079]: Cache-Control: no-store,no-cache
Jul 22 08:11:23 stahl runner[2010079]: Content-Length: 0
Jul 22 08:11:23 stahl runner[2010079]: Date: Mon, 22 Jul 2024 08:11:23 GMT
Jul 22 08:11:23 stahl runner[2010079]: Pragma: no-cache
Jul 22 08:11:23 stahl runner[2010079]: Server: Kestrel
Jul 22 08:11:23 stahl runner[2010079]: Strict-Transport-Security: max-age=2592000
Jul 22 08:11:23 stahl runner[2010079]: X-Frame-Options: SAMEORIGIN
Jul 22 08:11:23 stahl runner[2010079]: X-Tfs-Processid: e345953b-a8f5-481b-97b0-3d69274f0908
Jul 22 08:11:23 stahl runner[2010079]: X-Tfs-Session: 0f36162f-2a88-41f9-a349-2eafe42c05a8
Jul 22 08:11:23 stahl runner[2010079]: X-Vss-E2eid: f0f54673-25fb-4b53-b0ac-4b7d0aef3558
Jul 22 08:11:23 stahl runner[2010079]: X-Vss-Senderdeploymentid: a07ab14e-025a-39c3-8d53-788cd7ce207f
Jul 22 08:11:23 stahl runner[2010079]: Body: ``
Jul 22 08:11:23 stahl runner[2010079]: Http GET Request started https://pipelinesghubeus2.actions.githubusercontent.com/eMc1GyGdYp1Pn3AIiLBt1AQy39VBhA0ak6WnGv2vWqR163Rx43/_apis/distributedtask/pools/1/messages?api-version=5.1-preview&sessionId=ccdedcca-3a1f-4b20-9bcd-9b5735dcf1aa
Jul 22 08:11:23 stahl runner[2010079]: Headers:
Jul 22 08:11:23 stahl runner[2010079]: Accept: application/json; api-version=5.1-preview
Jul 22 08:11:23 stahl runner[2010079]: Authorization: bearer eyJ0eXAiOiJKV1<REDACTED-BY-ME>6L89puxacwyeVy3-XjY9cltQ
Jul 22 08:11:23 stahl runner[2010079]: X-TFS-FedAuthRedirect: Suppress
Jul 22 08:11:23 stahl runner[2010079]: X-TFS-Session: 3ccfaec8-71eb-4b25-86a2-2336e3eee847
Jul 22 08:11:23 stahl runner[2010079]: X-VSS-E2EID: 6bd58ec0-c217-4fc6-8449-57c17d9abd2e
Jul 22 08:11:23 stahl runner[2010079]: Body: ``
Jul 22 08:12:14 stahl runner[2010079]: Http GET Request finished 202 https://pipelinesghubeus2.actions.githubusercontent.com/eMc1GyGdYp1Pn3AIiLBt1AQy39VBhA0ak6WnGv2vWqR163Rx43/_apis/distributedtask/pools/1/messages?api-version=5.1-preview&sessionId=ccdedcca-3a1f-4b20-9bcd-9b5735dcf1aa
Jul 22 08:12:14 stahl runner[2010079]: Headers:
Jul 22 08:12:14 stahl runner[2010079]: Activityid: 5d4c2436-af15-4080-b761-d700b16a6b3f
Jul 22 08:12:14 stahl runner[2010079]: Cache-Control: no-store,no-cache
Jul 22 08:12:14 stahl runner[2010079]: Content-Length: 0
Jul 22 08:12:14 stahl runner[2010079]: Date: Mon, 22 Jul 2024 08:12:13 GMT
Jul 22 08:12:14 stahl runner[2010079]: Pragma: no-cache
Jul 22 08:12:14 stahl runner[2010079]: Server: Kestrel
Jul 22 08:12:14 stahl runner[2010079]: Strict-Transport-Security: max-age=2592000
Jul 22 08:12:14 stahl runner[2010079]: X-Frame-Options: SAMEORIGIN
Jul 22 08:12:14 stahl runner[2010079]: X-Tfs-Processid: e345953b-a8f5-481b-97b0-3d69274f0908
Jul 22 08:12:14 stahl runner[2010079]: X-Tfs-Session: 3ccfaec8-71eb-4b25-86a2-2336e3eee847
Jul 22 08:12:14 stahl runner[2010079]: X-Vss-E2eid: 6bd58ec0-c217-4fc6-8449-57c17d9abd2e
Jul 22 08:12:14 stahl runner[2010079]: X-Vss-Senderdeploymentid: a07ab14e-025a-39c3-8d53-788cd7ce207f
Jul 22 08:12:14 stahl runner[2010079]: Body: ``
Jul 22 08:12:14 stahl runner[2010079]: Http GET Request started https://pipelinesghubeus2.actions.githubusercontent.com/eMc1GyGdYp1Pn3AIiLBt1AQy39VBhA0ak6WnGv2vWqR163Rx43/_apis/distributedtask/pools/1/messages?api-version=5.1-preview&sessionId=ccdedcca-3a1f-4b20-9bcd-9b5735dcf1aa
Jul 22 08:12:14 stahl runner[2010079]: Headers:
Jul 22 08:12:14 stahl runner[2010079]: Accept: application/json; api-version=5.1-preview
Jul 22 08:12:14 stahl runner[2010079]: Authorization: bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1N<REDACTED-BY-ME>Y4s_axzt5R6L89puxacwyeVy3-XjY9cltQ
Jul 22 08:12:14 stahl runner[2010079]: X-TFS-FedAuthRedirect: Suppress
Jul 22 08:12:14 stahl runner[2010079]: X-TFS-Session: aa4e9035-35db-43f8-ae62-3590a0cccd4f
Jul 22 08:12:14 stahl runner[2010079]: X-VSS-E2EID: e023eccf-7f73-4029-ae9e-6f0d253193e9
Jul 22 08:12:14 stahl runner[2010079]: Body: ``

It keeps printing this request-started/request-finished with some interval

ChristopherHX commented 3 months ago

The runner which does not pick up the job is shown as Active in the github web UI

This is the problem, somehow your runner has been desynchronized with GitHub.

Usually I do one of the following to get out of the desynced state

Take the runner offline for some hours and Hope GitHub unset the active flag (slow)
- While the runner is online, it usually takes more time. (not tested)
- Usually I didn't wanted to wait and keep the runner offline
Force (e.g. runners with active jobs usually can't be deleted) remove the runner and reconfigure it (fast)

Such a state happend often during initial development of this runner, before it finished the job request correctly

Now the question is how did the runner of your organization end up in such a state mismatch

did it crash? My recovery logic might doesn't work anymore as of 2024 changes to GitHub backend
Service changes?
- This could be indeed related to the linked issue, but only past logs you might haven't poszed can tell

EDIT

Or is the active state as long as the job should start on that runner? and then inactive again if the service fails it after timeout

igagis commented 3 months ago

Yes, after the job fails (times out) the runner goes to idle state in web ui.

I had similar problems before, like last year, and then I just removed all the runners and registered them again, and it helped back then. But today even this does not help. I have just tried removing one of the runners and registering it again, still same stuff.

igagis commented 3 months ago

did it crash?

I don't think so, because I have about 14 runners registered in the same organization and all of them don't work

ChristopherHX commented 3 months ago

What happens if you register actions/runner? Then share your _diag folder after it has got a job

My org doesn't seem to have that backend change, no idea to be honest

igagis commented 3 months ago

actions/runner seems working fine.

Here is the _diag folder after successfully running one job.

diag.zip

ChristopherHX commented 3 months ago

Thanks for the log, yes you have been migrated to a new backend. So it is #186 in an unexpected behavior

[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper] Starting process:
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper]   File name: '/home/cppfw/actions-runner/bin/Runner.Worker'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper]   Arguments: 'spawnclient 107 110'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper]   Working directory: '/home/cppfw/actions-runner/bin'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper]   Require exit code zero: 'False'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper]   Encoding web name:  ; code page: ''
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper]   Force kill process on cancellation: 'True'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper]   Redirected STDIN: 'False'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper]   Persist current code page: 'False'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper]   Keep redirected STDIN open: 'False'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper]   High priority process: 'True'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper] Process started with process id 3210, waiting for process exit.
[2024-07-22 11:32:37Z INFO JobDispatcher] Send job request message to worker for job 5b715e69-dabf-5b04-21eb-12d35d1e3362.
[2024-07-22 11:32:37Z INFO ProcessChannel] Sending message of length 30778, with hash '3ab14776445536971a5bfcebf829eb1cd0d71d0ff024da67339e82b52497c3ce'
[2024-07-22 11:32:37Z INFO JobNotification] Entering JobStarted Notification
[2024-07-22 11:32:37Z INFO JobNotification] Entering StartMonitor
[2024-07-22 11:32:38Z INFO MessageListener] BrokerMigration message received. Polling Broker for messages...
[2024-07-22 11:33:28Z INFO MessageListener] BrokerMigration message received. Polling Broker for messages...
[2024-07-22 11:33:37Z INFO JobDispatcher] Successfully renew job request 181753, job is valid till 07/22/2024 11:43:37
[2024-07-22 11:34:18Z INFO MessageListener] BrokerMigration message received. Polling Broker for messages...
[2024-07-22 11:34:37Z INFO JobDispatcher] Successfully renew job request 181753, job is valid till 07/22/2024 11:44:37

Without a broken runner to debug and bisect this, I need to wait or check more orgs I have access to.

Some known differences

this runner claims to have version 3.0.0
this runner doesn't send it's version and some other info in the GetMessage Query

Those differences might let their backend decide something weird and not sending the migration message that could have been seen as unknown message in the log

igagis commented 3 months ago

@ChristopherHX would it help in analyzing this if I provide you some access to my organization? What kind of access would you need? Like token for adding new runners and a repo in the org to trigger jobs?

ChristopherHX commented 3 months ago

would it help in analyzing this if I provide you some access to my organization?

Yes this would help me a lot if your are still enrolled by GH.

I need to be able to trigger builds e.g. collabator of a test repo with push rights
At least a jit runner config e.g. created by github-act-runner --unattended --url <url> --token <token> --labels test-runner --print-jitconfig that can be triggered by the test repo (by email christopher.homberger@web.de or via issues if the test repo is private)
If repository runner are broken too admin access to a single test repo without anything else is enough

Like token for adding new runners

The problem with that is the expire time of ca. 1h

igagis commented 3 months ago

I have created a test repo in my org and invited you there as maintainer. Also, I have added a test runner. Is it safe if I share the jitconfig string right here publicly?

Also, I tested if the runner is added to one repo only, it works. So, I added the test runner to the org to reproduce the problem.

igagis commented 3 months ago

@ChristopherHX sent the jitconfig string to you by email.

igagis commented 3 months ago

I'm not running the test_runner anymore, so you can start it yourself using jitconfig string I sent you

ChristopherHX commented 3 months ago

Thank you, I will now check if the credentials are working on my end.

Trying this first by feeding jitconfig to actions/runner then debug my runner

ChristopherHX commented 3 months ago

I'm one step closer

Body: `{"messageType":"BrokerMigration","body":"{\"brokerBaseUrl\":\"https://broker.actions.githubusercontent.com\"}"}`

Now need to implement polling that url

ChristopherHX commented 3 months ago

GH will fool me, we have now redirect after redirect

{"messageId":8422780923603100092,"messageType":"RunnerJobRequest","body":"{\"runner_request_id\":\"O_kgDOBDHdtA-181879\",\"run_service_url\":\"\"}"}

BrokerMigration
RunnerJobRequest
tbd

ChristopherHX commented 3 months ago

It worked now somehow:

https://github.com/cppfw/testrepo/actions/runs/10079029467/job/27868742920

ChristopherHX commented 3 months ago

Http POST Request finished 404 https://broker.actions.githubusercontent.com/renewjob
Headers: 
Content-Length: 20
Content-Type: text/plain; charset=utf-8
Date: Wed, 24 Jul 2024 16:02:32 GMT
Server: github.com
X-Github-Backend: Kubernetes
X-Github-Request-Id: E0FA:6E1D7:3E8E95A:3F1DB02:66A12555

Body: `Not found: /renewjob`

Some code in my experiment are not working...., probably used the wrong domain of the new url chaos

ChristopherHX commented 3 months ago

Experiment here: https://github.com/ChristopherHX/github-act-runner/tree/experimental-broker-phase-1

Also a panic has been seen, stability uncertain

igagis commented 3 months ago

reverse engineering live :) no hurry ;)

igagis commented 3 months ago

I have just updated my runners, now it seems working :)! Thanks for quick turnover on this :)

igagis commented 3 months ago

@ChristopherHX do you still need the test_runner for any further experiments? Or can I remove it?

ChristopherHX commented 3 months ago

Yes you can remove the test_runner, this change is currently disabled again server side

They are much faster at Rollback than in a consistent Rollout

Trivia

Yesterday GitHub has released runner 2.318, two hours later they have rolled back this change again until they allow runner 2.317.0 at least for 30days from yesterday to receive jobs.

Your previous issue were also this breaking change, that was rolled back within 2days because 2.317 become 30days old xd

Now they need to fix their runner deprecation code and start the rollout again

igagis commented 3 months ago

Oh, looks like they are having a mess there :). Ok, I will remove the test_runner. Thanks again!

ChristopherHX / github-act-runner

runners from some organisations do not pick jobs #187