Self-hosted runners disappeared

BrightRan commented 4 years ago

Associated GitHub Community topic: https://github.community/t/disappearing-self-hosted-runners/137669

The customer has added some self-hosted runners for his repository, but the runners would completely disappear as if he never added any. When he refreshes, the runners would come back. Some would be Offline but would go back to being Idle after another refresh. Other times when he refreshes the runners disappear again. When the customer logs into the runner machines to check their status, he can see a lot of connection retries.

2020-10-13 21:12:44Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2020-10-13 21:14:42Z: Runner reconnected.
2020-10-13 21:15:42Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.

chingc commented 4 years ago

@BrightRan Thank you for creating this ticket. I'd like to add that I'm using v2.273.5 of the runner on a plain Amazon Linux 2 EC2 Instance. I haven't experienced any issues yesterday so perhaps it was an intermittent issue on GitHub or Amazon's end.

brandan-schmitz commented 4 years ago

I have see this issue as well using the save version as @chingc. Today I received an error from my build actions for a project, seems that the runner vanished from the project and left me with no runners registered. Opening the runner itself on my server it showed that the runner was still registered but was getting an unknown disconnect error from GitHub, and that was all that it would do was loop between restarting the runner service to saying it received an unknown disconnect error back to starting the runner service.

I have since wiped the old runner and re-downloaded it and registered it back to the server again as I needed the build pipeline running but not sure when it first occurred on my system.

jonnikim commented 2 years ago

This still seems to be an issue.

We had 5 runners, 4 of them were offline and 1 was idle. I disabled Actions to fix some syntax. Left it alone for ~2+ weeks and came back to see that only the idle one was remaining. The other 4 looks to have been deleted. Re-enabling Actions didn't bring them back either.

I still have the directories for the other 4 runners, but trying to start them throws this error Failed to create a session. The runner registration has been deleted from the server, please re-configure.

nikola-jokic commented 2 years ago

Hi @jonnikim,

If the runner does not get any tasks for 30 days, it is being cleaned up from the service side. That might be the reason why you needed to re-configure your runner again.

@brandan-schmitz, @chingc, does this help?

mhl-itm-bhg commented 2 years ago

I am experiencing a similar issue, when attempting to run the actions-runner (runc.cmd) on my machine I get the following error Failed to create a session. The runner registration has been deleted from the server, please re-configure. When attempting to reconfigure the runner (config.cmd) I get the following error Cannot configure the runner because it is already configured. To reconfigure the runner, run 'config.cmd remove' or './config.sh remove' first. When I run config.cmd remove I'm asked to enter a runner removal token.

I have no idea where to get this token. Is there any way to reconfigure without being dependent on tokens that disappeared from the repo?

nikola-jokic commented 2 years ago

Hi @mhl-itm-bhg,

You can just remove a file named .runner inside your root directory from where you are executing config.sh.

nikola-jokic commented 2 years ago

Hi everyone,

Since this seems to be resolved, I am going to close this issue. If you experience this issue again, you can create a new issue or write a comment here, and we will re-open it :smile:

shishodiyas commented 2 years ago

how can we make it so that the runner doesn't get deleted.

whutchinson98 commented 2 years ago

I just experienced this issue. Is there any update on how to prevent this?

nikola-jokic commented 2 years ago

The docs now state:

A self-hosted runner is automatically removed from GitHub if it has not connected to GitHub Actions for more than 14 days.

@shishodiyas, @whutchinson98 you can't. One way you can automate this is to use API to fetch the registration token and register your runner again from a shell script.

sxtyxmm commented 2 years ago

I have a shell script for the same but can you elaborate on the API use.

nikola-jokic commented 2 years ago

Of course, this docs describe how to use API to fetch registration token for example: https://docs.github.com/en/rest/actions/self-hosted-runners#create-a-registration-token-for-a-repository. You can create small script that can fetch the registration token, then once you start configuring your runner, you may want to add flags like :--unattended and --replace.

shukriadams commented 1 year ago

@nikola-jokic From an automation point of view, this is some pretty anti-user design. Why would you auto-terminate an integration that has been down for 14 days? It's not costing Github anything that a runner that one of us hosting has gone idle. Some of us do projects as hobbies, we take breaks from them, we have lives. Is it really that much to ask that an automated build works again after a Raspberry Pi got accidentally unplugged for two weeks? I actually spend more time maintaining self-hosted runners than I build with them.

nikola-jokic commented 1 year ago

Hi @shukriadams,

For most enterprises, this is expected and wanted. We understand you’re not most enterprises. If you want to discuss it more:

Bring it to forums to start a product feedback.
Once again, this is not something you can change from the runner code, so bring it to the feedback page :relaxed:

newfunda commented 9 months ago

A self-hosted runner is automatically removed from GitHub Enterprise Cloud if it has not connected to GitHub Actions for more than 14 days. An ephemeral self-hosted runner is automatically removed from GitHub Enterprise Cloud if it has not connected to GitHub Actions for more than 1 day.

sxtyxmm commented 9 months ago

😂😂😂😂😂😂😂😂

On Thu, 1 Feb 2024 at 19:24, Shukri Adams @.***> wrote:

I found a good workaround.

Delete the Github self-hosted agent from your local system.

Disable Github self-hosted runner integration from the settings page on your project.

Install Jenkins on your own infrastructure.

Create a Jenkins job that builds your project.

Hope that helps.

— Reply to this email directly, view it on GitHub https://github.com/actions/runner/issues/756#issuecomment-1921381647, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZFIPOB2IEI6XMFPI3ZFYUTYRONBPAVCNFSM4SRMYXNKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJSGEZTQMJWGQ3Q . You are receiving this because you commented.Message ID: @.***>

pitoniak32 commented 9 months ago

It seems the original error was not fully addressed in this issue 😓

2020-10-13 21:12:44Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2020-10-13 21:14:42Z: Runner reconnected.
2020-10-13 21:15:42Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.

I am seeing this same thing on our self-hosted runners. And the only fixes I have found are related to disabling ipv6. Is there another solution for this? or is there at least a workaround?

ttolbol commented 6 months ago

Just give us a setting to disable the automatic removal! It's completely ridiculous that I have to manually add a self hosted runner whenever I have to deploy an update (usually once per month). It takes me more time to go through the whole process of adding the runner again, than the time it takes to actually run the process. It didn't use to be this way. An automation tool that requires manual labour to use is not much of an automation tool.

dgiambo commented 6 months ago

This just bit me too. Can we please have a setting for this, or at least a warning of some kind. This is not a good user experience. Why is the deletion not recorded in the audit logs?

sxtyxmm commented 6 months ago

It seems the original error was not fully addressed in this issue 😓
2020-10-13 21:12:44Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2020-10-13 21:14:42Z: Runner reconnected.
2020-10-13 21:15:42Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
I am seeing this same thing on our self-hosted runners. And the only fixes I have found are related to disabling ipv6. Is there another solution for this? or is there at least a workaround?

Best i could come up with was to write an automation to add the runner again. after every 14 days.

tedgarb commented 6 months ago

Adding my voice to the dissatisfaction here. There are absolutely no docs on how to reset a runner once github has unilaterally purged it. If github insists on this design paradigm for what are supposed to be persistent self-hosted runners, I would like to request

The documentation at https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/adding-self-hosted-runners?learn=hosting_your_own_runners&learnProduct=actions#adding-a-self-hosted-runner-to-an-organization actually explain that runners will be unilaterally purged
Documentation be added on how to reset a runner once it has been registered, set up as a service, and then broken by github

oschwartz10612 commented 2 months ago

Adding support for this issue here as well! We need a setting; runners cant just be deleted because they are turned off. We dont pay for our EC2 runners to be on all the time if we are only using it once a month and manually adding them back every time is ridiculous!

BrendenWalker commented 1 month ago

Thought I'd add that we just had multiple self-hosted runners disappear from GitHub organization configuration. Nobody else has access to GH config or the runners and I know for sure that I didn't remove them.

2 of the missing runners were running jobs 3 days ago. Strangely.. one runner is still present. No clue why just this one.

I submitted a ticket, hopefully they can pull from a backup. Access to some of these runners can be difficult, so just adding again would be a hassle.

cullenwren-volair commented 1 month ago

Very confused why this happened to my self-hosted runners. Ours are used multiple times a day yet I've had it happen twice now that they were removed for seemingly no reason. Our runners are setup as services and checking sudo ./svc.sh status shows they are still connected to github despite having been removed? It would be nice if restarting the service or uninstalling and reinstalling the service allowed the runners to be re-added instead of having to reconfigure them

BrendenWalker commented 1 month ago

Very confused why this happened to my self-hosted runners. Ours are used multiple times a day yet I've had it happen twice now that they were removed for seemingly no reason. Our runners are setup as services and checking sudo ./svc.sh status shows they are still connected to github despite having been removed? It would be nice if restarting the service or uninstalling and reinstalling the service allowed the runners to be re-added instead of having to reconfigure them

I'm not sure if you can get to this or comment on it, but my ticket: https://support.github.com/ticket/enterprise/122857/2997363

Seems like it's not just me. I have created scripts to automate installation of runners and I'm now keeping the configuration stored in version control (except secrets of course) to make it easy to reinstall.

nextjsdude commented 1 month ago

Associated GitHub Community topic: https://github.community/t/disappearing-self-hosted-runners/137669

The customer has added some self-hosted runners for his repository, but the runners would completely disappear as if he never added any. When he refreshes, the runners would come back. Some would be Offline but would go back to being Idle after another refresh. Other times when he refreshes the runners disappear again. When the customer logs into the runner machines to check their status, he can see a lot of connection retries.
2020-10-13 21:12:44Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2020-10-13 21:14:42Z: Runner reconnected.
2020-10-13 21:15:42Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.

2024 and i face the same issue.

It seems the original error was not fully addressed in this issue 😓
2020-10-13 21:12:44Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2020-10-13 21:14:42Z: Runner reconnected.
2020-10-13 21:15:42Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
I am seeing this same thing on our self-hosted runners. And the only fixes I have found are related to disabling ipv6. Is there another solution for this? or is there at least a workaround?
Best i could come up with was to write an automation to add the runner again. after every 14 days.

can you share your automation configuration

BrendenWalker commented 1 month ago

I'm using a workflow I call 'doorstop'. I have to manually update it with new runners but that's so far not been an issue.

Example:

name: doorstop

on:
  schedule:
    # times in UTC, standard Chron format
    - cron:  '0 05 01,10,20 * *' # 5am 1st/10th/20th day of month

  workflow_dispatch:

jobs: 
  this_runner:
    runs-on: [self-hosted,thisrunner]
    steps:
      - name: Hello
        shell: powershell
        run: Write-Host "Hello World"

  that_runner:
    runs-on: [self-hosted,thatrunner]
    steps:
      - name: Hello
        shell: powershell
        run: Write-Host "Hello World"

nextjsdude commented 1 month ago

Associated GitHub Community topic: https://github.community/t/disappearing-self-hosted-runners/137669

The customer has added some self-hosted runners for his repository, but the runners would completely disappear as if he never added any. When he refreshes, the runners would come back. Some would be Offline but would go back to being Idle after another refresh. Other times when he refreshes the runners disappear again. When the customer logs into the runner machines to check their status, he can see a lot of connection retries.
2020-10-13 21:12:44Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2020-10-13 21:14:42Z: Runner reconnected.
2020-10-13 21:15:42Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.

It would be good if the GitHub owner could receive an alert via email about the runner approaching 14 days rather than deleting it totally. How i fixed it in my case:

Decided to remove the config.sh from the runner but i was asked to enter the runner token. Off course i didn't have/know it and besides it was already deleted from GitHub servers, so it just returned a 404 error, no matter what token i entered.
i finally uninstalled svc and remove that runner. I created a new runner and installed svc in that runner. Long but worked.

nextjsdude commented 1 month ago

I'm using a workflow I call 'doorstop'. I have to manually update it with new runners but that's so far not been an issue.

Example:

name: doorstop

on:
  schedule:
    # times in UTC, standard Chron format
    - cron:  '0 05 01,10,20 * *' # 5am 1st/10th/20th day of month

  workflow_dispatch:

jobs: 
  this_runner:
    runs-on: [self-hosted,thisrunner]
    steps:
      - name: Hello
        shell: powershell
        run: Write-Host "Hello World"

  that_runner:
    runs-on: [self-hosted,thatrunner]
    steps:
      - name: Hello
        shell: powershell
        run: Write-Host "Hello World"

Absolutely brilliant. Thanks man

gojimmypi commented 1 month ago

I am also encountering a problem where my self-hosted runner gets removed, and long before a 2 week unused expiration.

I have test scripts on WSL in a Windows 11 VM. I had another odd WSL error that caused my runner to crash, Only a few days later when I noticed the action was not working, upon restarting it, I noticed that the runner object no longer existed on GitHub.

As this issue is closed; is there an open one on this topic, or any known reliable workarounds?

BrendenWalker commented 1 month ago

I just recently had a runner disappear after being offline for 6 days.... maybe if enough people chime in here it'll be reopened.

gojimmypi commented 1 month ago

@BrendenWalker would you happen to be on a VPN? I was asked that question & yes: I was.

I'm trying again on a non-VPN segment to see if that helps.

alvieridev commented 1 month ago

I am also encountering a problem where my self-hosted runner gets removed, and long before a 2 week unused expiration.

I have test scripts on WSL in a Windows 11 VM. I had another odd WSL error that caused my runner to crash, Only a few days later when I noticed the action was not working, upon restarting it, I noticed that the runner object no longer existed on GitHub.

As this issue is closed; is there an open one on this topic, or any known reliable workarounds?

I realised the runner gets removed even before the two weeks mark, if the runner is not idle or active. Since your runner crashed, I say it was removed before 14days because it wasn't active. A work around that worked for me ( as suggested by @BrendenWalker ) is to setup a workflow runner that runs every two days or one week. Depends on you. The runner should perform very minimal task like echo or whoami. This will give github the impression that the runner is still active. This has worked for me.

BrendenWalker commented 1 month ago

I am also encountering a problem where my self-hosted runner gets removed, and long before a 2 week unused expiration. I have test scripts on WSL in a Windows 11 VM. I had another odd WSL error that caused my runner to crash, Only a few days later when I noticed the action was not working, upon restarting it, I noticed that the runner object no longer existed on GitHub. As this issue is closed; is there an open one on this topic, or any known reliable workarounds?

I realised the runner gets removed even before the two weeks mark, if the runner is not idle or active. Since your runner crashed, I say it was removed before 14days because it wasn't active. A work around that worked for me ( as suggested by @BrendenWalker ) is to setup a workflow runner that runs every two days or one week. Depends on you. The runner should perform very minimal task like echo or whoami. This will give github the impression that the runner is still active. This has worked for me.

Sadly, my workaround didn't save me from the last one. 6 days so my workflow didn't have a chance to startup the VM and run an action.

I've also taken to semu-automating installation on Windows via powershell. If this keeps happening I'll probably deploy ansible or some other full automated means..

BrendenWalker commented 1 month ago

@BrendenWalker would you happen to be on a VPN? I was asked that question & yes: I was.

I'm trying again on a non-VPN segment to see if that helps.

This latest failure is an Azure VM.. no VPN that I know of, however it does not have a public IP address and icmp traffic to the internet doesn't work so no ping.

The config.cmd --check functionality reports failure when it can't ping some servers even though it's already verified HTTPS access to the same servers. AFAIK https access is all that runners require, which would explain why my action runners work fine.. as long as they're not booted out of GitHub configuration.

gojimmypi commented 1 month ago

@BrendenWalker and @alvieridev there's certainly a possibility that I have an unstable network, even without the VPN.

With your experience with self-hosted runners, what do you think of this (admittedly hacky) idea:

while true; do
    timeout 6h ./run.sh  # Run for 6 hours
    echo "Restarting run.sh after 6 hours..."
done

BrendenWalker commented 1 month ago

@BrendenWalker and @alvieridev there's certainly a possibility that I have an unstable network, even without the VPN.

With your experience with self-hosted runners, what do you think of this (admittedly hacky) idea:
while true; do
    timeout 6h ./run.sh  # Run for 6 hours
    echo "Restarting run.sh after 6 hours..."
done

A bit blunt, but sometimes that's necessary. I haven't had that particular issue (yet). In my case I'm running as a windows service (so far, we have *nix runners in GitLab but haven't migrated those projects yet), and they can be setup to automatically restart.. That is IF they stop cleanly and notify the windows SCM that they stopped ;-)

gojimmypi commented 1 month ago

fwiw, on the list of "possible solutions, but won't work for me".... is this scheduled keep-alive task.

TIL scheduled tasks only work on the main branch, which is undesired when contributing upstream via a fork. :/

name: Keep Alive

on:
  schedule:
    # Runs every hour
    - cron: "0 * * * *"

jobs:
  keep-alive:
    runs-on: self-hosted  # Ensure this runs on your self-hosted runner

    steps:
      - name: Run keep-alive task
        run: |
          echo "Running periodic keep-alive task."

Perhaps this might help someone that's ok with main branch workflow edits.

BrendenWalker commented 1 month ago

I just had GH support give me this gem:

GitHub does not remove runners.

Had to refer them to the GitHub documentation which contradicts that:

A self-hosted runner is automatically removed from GitHub Enterprise Cloud if it has not connected to GitHub Actions for more than 14 days. An ephemeral self-hosted runner is automatically removed from GitHub Enterprise Cloud if it has not connected to GitHub Actions for more than 1 day.

cullenwren-volair commented 1 month ago

A self-hosted runner is automatically removed from GitHub Enterprise Cloud if it has not connected to GitHub Actions for more than 14 days

I wonder if the runner is registered with a particular IP address that is then never re-used when connecting (in the instances of VMs) Github will remove the runner after the 14 days despite the runner being used within that window

BrendenWalker commented 1 month ago

A self-hosted runner is automatically removed from GitHub Enterprise Cloud if it has not connected to GitHub Actions for more than 14 days

I wonder if the runner is registered with a particular IP address that is then never re-used when connecting (in the instances of VMs) Github will remove the runner after the 14 days despite the runner being used within that window

Interesting theory. However, I would expect it to show offline whenever the IP address changed. That's not been the case so far in my experience. Last one was removed 6 days after running a job successfully.

I think their is a bug in the 'cleanup' logic and it's simply not functioning like it should. They just need to open source all of GitHub and I'll fix the dang thing ;-)

gojimmypi commented 1 month ago

I just had GH support give me this gem:

GitHub does not remove runners.

Well, that's false. I've had self hosted runners go missing long before 14 days of inactivity. See screen snip, above; last processed based on a commit action on 10/2 then when I tried to restart it on 10/8, the runner was gone from my GitHub account and I had to setup a new one.

I wonder if the runner is registered with a particular IP address that is then never re-used when connecting (in the instances of VMs) Github will remove the runner after the 14 days despite the runner being used within that window

Now that's an interesting hypothesis.

... GitHub Enterprise Cloud ...

I'm not an enterprise customer.

BrendenWalker commented 3 weeks ago

Hey everyone! This just in on my ticket:

I continued working on this, and opened an internal issue to track down this unexpected behaviour. Update from engineers on this unexpected deregistering of runners does identify this as a bug that was inadvertently introduced with recent updates, just as we discovered in the logs

The team identified a bug that could cause runners that were created over 14 days ago and offline for more than 1 hour could be incorrectly removed as part of our dormant runner cleanup job. It looks like your affected runner fits this criteria.

We've rolled out a mitigation to prevent this from happening any further, and are in the process of rolling out a long term fix to make sure self-hosted runner are only dormant if offline for 14 days.

that might.. just might confirm that we're not imagining things ;-)

IronSean commented 3 weeks ago

Hey everyone! This just in on my ticket:

I continued working on this, and opened an internal issue to track down this unexpected behaviour. Update from engineers on this unexpected deregistering of runners does identify this as a bug that was inadvertently introduced with recent updates, just as we discovered in the logs The team identified a bug that could cause runners that were created over 14 days ago and offline for more than 1 hour could be incorrectly removed as part of our dormant runner cleanup job. It looks like your affected runner fits this criteria. We've rolled out a mitigation to prevent this from happening any further, and are in the process of rolling out a long term fix to make sure self-hosted runner are only dormant if offline for 14 days.

that might.. just might confirm that we're not imagining things ;-)

If this is true and it was an error that caused them to delete after 14 days of Idle, and they were meant to only delete after 14 days Offline, this is starting to approach a sane policy. Deleting after 14 days or inactivity (or 14 days of existence and 1 hour of inactivity) is baffling.

sakhisoufiane commented 3 weeks ago

We've had all our self-hosted runners deleted. For anyone encountering the same issue, it looks like it was a bug on Github's end that deleted runners it thought were dormant when in fact they were active but in an idle state.

They couldn't restore them, so we had to reconfigure these runners from scratch.

The reply we've got from support if it can help anyone:

We've found that a cleanup job for dormant runners removed this self-hosted runner on 6th Oct. A self-hosted runner is automatically removed from GitHub if it has not connected to GitHub Actions for more than 14 days.

However, even though it would be the expected behaviour that a runner is removed if it has remained offline for 14 days, I do appreciate that the workflows in [redacted-repo-name] are run more frequently than 14 days, and that it should not have been the case that the runners were removed.

This indicated that there was a bug that caused runners to incorrectly be marked as dormant and then removed. Please rest assured that we are already working on a fix right now, and have also already rolled out a temporary mitigation plan to prevent self-hosted runners from being prematurely removed while we work on it.

FearlessHyena commented 1 week ago

I just got hit with this problem and a runner I registered was removed after just 1 day of being offline, so it seems the fix still hasn't been rolled out yet

Not having a way to disable or configure the time before runners are automatically removed is a usability issue so I've gone ahead and created a new discussion here to make it configurable https://github.com/orgs/community/discussions/142834

Please comment and vote on the discussion so we can make this happen!

actions / runner

Self-hosted runners disappeared #756