Closed igagis closed 3 months ago
The runner which does not pick up the job is shown as Active in the github web UI
This is the problem, somehow your runner has been desynchronized with GitHub.
Usually I do one of the following to get out of the desynced state
Such a state happend often during initial development of this runner, before it finished the job request correctly
Now the question is how did the runner of your organization end up in such a state mismatch
EDIT
Or is the active state as long as the job should start on that runner? and then inactive again if the service fails it after timeout
Yes, after the job fails (times out) the runner goes to idle state in web ui.
I had similar problems before, like last year, and then I just removed all the runners and registered them again, and it helped back then. But today even this does not help. I have just tried removing one of the runners and registering it again, still same stuff.
did it crash?
I don't think so, because I have about 14 runners registered in the same organization and all of them don't work
What happens if you register actions/runner? Then share your _diag folder after it has got a job
My org doesn't seem to have that backend change, no idea to be honest
actions/runner seems working fine.
Here is the _diag
folder after successfully running one job.
Thanks for the log, yes you have been migrated to a new backend. So it is #186 in an unexpected behavior
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper] Starting process:
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper] File name: '/home/cppfw/actions-runner/bin/Runner.Worker'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper] Arguments: 'spawnclient 107 110'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper] Working directory: '/home/cppfw/actions-runner/bin'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper] Require exit code zero: 'False'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper] Encoding web name: ; code page: ''
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper] Force kill process on cancellation: 'True'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper] Redirected STDIN: 'False'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper] Persist current code page: 'False'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper] Keep redirected STDIN open: 'False'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper] High priority process: 'True'
[2024-07-22 11:32:37Z INFO ProcessInvokerWrapper] Process started with process id 3210, waiting for process exit.
[2024-07-22 11:32:37Z INFO JobDispatcher] Send job request message to worker for job 5b715e69-dabf-5b04-21eb-12d35d1e3362.
[2024-07-22 11:32:37Z INFO ProcessChannel] Sending message of length 30778, with hash '3ab14776445536971a5bfcebf829eb1cd0d71d0ff024da67339e82b52497c3ce'
[2024-07-22 11:32:37Z INFO JobNotification] Entering JobStarted Notification
[2024-07-22 11:32:37Z INFO JobNotification] Entering StartMonitor
[2024-07-22 11:32:38Z INFO MessageListener] BrokerMigration message received. Polling Broker for messages...
[2024-07-22 11:33:28Z INFO MessageListener] BrokerMigration message received. Polling Broker for messages...
[2024-07-22 11:33:37Z INFO JobDispatcher] Successfully renew job request 181753, job is valid till 07/22/2024 11:43:37
[2024-07-22 11:34:18Z INFO MessageListener] BrokerMigration message received. Polling Broker for messages...
[2024-07-22 11:34:37Z INFO JobDispatcher] Successfully renew job request 181753, job is valid till 07/22/2024 11:44:37
Without a broken runner to debug and bisect this, I need to wait or check more orgs I have access to.
Some known differences
Those differences might let their backend decide something weird and not sending the migration message that could have been seen as unknown message in the log
@ChristopherHX would it help in analyzing this if I provide you some access to my organization? What kind of access would you need? Like token for adding new runners and a repo in the org to trigger jobs?
would it help in analyzing this if I provide you some access to my organization?
Yes this would help me a lot if your are still enrolled by GH.
github-act-runner --unattended --url <url> --token <token> --labels test-runner --print-jitconfig
that can be triggered by the test repo (by email christopher.homberger@web.de or via issues if the test repo is private)Like token for adding new runners
The problem with that is the expire time of ca. 1h
I have created a test repo in my org and invited you there as maintainer. Also, I have added a test runner. Is it safe if I share the jitconfig string right here publicly?
Also, I tested if the runner is added to one repo only, it works. So, I added the test runner to the org to reproduce the problem.
@ChristopherHX sent the jitconfig string to you by email.
I'm not running the test_runner
anymore, so you can start it yourself using jitconfig string I sent you
Thank you, I will now check if the credentials are working on my end.
Trying this first by feeding jitconfig to actions/runner then debug my runner
I'm one step closer
Body: `{"messageType":"BrokerMigration","body":"{\"brokerBaseUrl\":\"https://broker.actions.githubusercontent.com\"}"}`
Now need to implement polling that url
GH will fool me, we have now redirect after redirect
{"messageId":8422780923603100092,"messageType":"RunnerJobRequest","body":"{\"runner_request_id\":\"O_kgDOBDHdtA-181879\",\"run_service_url\":\"\"}"}
It worked now somehow:
https://github.com/cppfw/testrepo/actions/runs/10079029467/job/27868742920
Http POST Request finished 404 https://broker.actions.githubusercontent.com/renewjob
Headers:
Content-Length: 20
Content-Type: text/plain; charset=utf-8
Date: Wed, 24 Jul 2024 16:02:32 GMT
Server: github.com
X-Github-Backend: Kubernetes
X-Github-Request-Id: E0FA:6E1D7:3E8E95A:3F1DB02:66A12555
Body: `Not found: /renewjob`
Some code in my experiment are not working...., probably used the wrong domain of the new url chaos
Experiment here: https://github.com/ChristopherHX/github-act-runner/tree/experimental-broker-phase-1
Also a panic has been seen, stability uncertain
reverse engineering live :) no hurry ;)
I have just updated my runners, now it seems working :)! Thanks for quick turnover on this :)
@ChristopherHX do you still need the test_runner
for any further experiments? Or can I remove it?
Yes you can remove the test_runner
, this change is currently disabled again server side
They are much faster at Rollback than in a consistent Rollout
Trivia
Yesterday GitHub has released runner 2.318, two hours later they have rolled back this change again until they allow runner 2.317.0 at least for 30days from yesterday to receive jobs.
Your previous issue were also this breaking change, that was rolled back within 2days because 2.317 become 30days old xd
Now they need to fix their runner deprecation code and start the rollout again
Oh, looks like they are having a mess there :). Ok, I will remove the test_runner
. Thanks again!
One of my github organization's runners are not picking up jobs anymore. It started recently. Runners from my other github organization do pick up jobs, those other runners are running on the same hardware as the runners from the first organization.
The runner which does not pick up the job is shown as Active in the github web UI. And the job it is supposed to pick always show
until it times out and fails.
The runner itself doesn't print any logs. For example, freshly restarted runner service logs look like:
This problem could be related to #186 , but as it doesn't print any logs it is hard to say.
This could also be a problem on github side.
edit: I have enabled
--trace
for the runner and here are the logs it prints when I try to start the job:It keeps printing this request-started/request-finished with some interval