actions / runner

The Runner for GitHub Actions :rocket:
https://github.com/features/actions
MIT License
4.66k stars 900 forks source link

Add ability to opt out of agent auto-update #246

Closed myoung34 closed 2 years ago

myoung34 commented 4 years ago

Describe the enhancement

Running the agent within docker and having it auto update causes the container to throttle and never succeed without having to manage PID 1

It would be nice to have something in config.sh or run.sh, possibly environment variables to help opt out of auto-update

Code Snippet

n/a

Additional information

n/a

bryanmacfarlane commented 4 years ago

We're also looking at producing an image as part of the runner build so if you are doing single use container builds then you never hit a required update - it's picking up the container with the runner in it that the service demands.

Question - are you doing single use containers or does the runner / container lifetime span jobs and builds in your scenario?

myoung34 commented 4 years ago

@bryanmacfarlane it spans

https://github.com/myoung34/docker-github-actions-runner

bryanmacfarlane commented 4 years ago

instead of entrypoint running run.sh (interactive) it should probably run runsvc.sh which handles update return codes. That coupled with a image released with every runner version would probably eliminate the issue

https://github.com/actions/runner/blob/master/src/Misc/layoutbin/runsvc.sh

bryanmacfarlane commented 4 years ago

^^ I'm working on some docker containers to be part of the build - I'll look over your files

myoung34 commented 4 years ago

just make sure the dockerfiles work for arm/arm64 please 😃

bryanmacfarlane commented 4 years ago

https://github.com/actions/runner/issues/239

SvanBoxel commented 4 years ago

I have a customer requesting this. At the moment the self-hosted runner self-updates, making it possible unpredictable for production workloads.

ioquatix commented 4 years ago

Please don't include docker by default. I already run this inside a container.

ioquatix commented 4 years ago

Here is my suggestions:

bryanmacfarlane commented 4 years ago

It's the service that demands and pulls the runner forward. There are new capabilities and features on a cloud service that's rolling forward which needs changes in the runner. Even if it's demanded only if you use a new feature, it still means a runner can get pulled forward.

So, when it gets pulled forward, if the container is referencing ./runsvc.sh, it will handle the update code and get pulled forward. if you're using a container built with the runner / latest, then you will minimize the frequency of the update (time).

Also note that if we build some containers and sim release with the runner, that doesn't preclude you from building your own containers as you're doing now. It would be a pure convenience for others.

ioquatix commented 4 years ago

I'd rather it stopped working than auto updated.

bryanmacfarlane commented 4 years ago

That's possible as well. Call ./run.sh, it will exit and not handle the update code. However, please be aware that all your workflows will stop running until you build another container or manually configure all your self hosted runners again. You will only be able to do that with the version the service demands that you didn't want to auto update to. So I'm not sure what the stop buys you.

Note that the runner is a simple step runner. The yaml and the orchestration runs service side (that's why things like matrix'ing sets with outputs etc.) and delegates steps to this app. We're also not talking about the hosted images, dev tools, etc. which usually affect the work being done. the tools in the container or on the self hosted runner will stay the same.

bryanmacfarlane commented 4 years ago

☝️ Also note that we are painfully aware of how this affects container scenarios and building your own. We're thinking about that so I don't want to discount the problem here. Just trying to explain how it works and what the current state is. Hope that helps.

ioquatix commented 4 years ago

I understand where you are coming from.

With CI, you are running a lot of untrusted code so maybe my argument doesn't hold much water.

That being said, you can isolate untrusted code using KVM, run it as an unprivileged user, trash the results.

However, the way the system is currently designed and distributed makes this difficult.

One would normally expect a package to have a clean split between "executables/assets" and "run-time state". For whatever reason, this package doesn't have that. Not even the configuration is separated out.

Ideally, you'd install something like github-actions into /usr/bin and start it by running github-actions --work-dir=/tmp/github-actions or something similar. It would read a per user and per system configuration file, or maybe accept a command line argument to a configuration file if required.

Some of the consequences of this design are:

So, github-actions necessitates containers and other isolation constructs because of it's design choices.

Auto update makes the package unpredictable. Some package systems provide verification functions to check if there are any errors on the file system or something has been manipulated. Some package managers provide permissions check system which validates file permissions. Some package managers provide file ownership query and other operations. If the package files are changing, these functions stop working. Additionally, it makes it hard to remove the package, if the files are not known. For example, removing this package in arch linux boils down to rm -rf which is, kind of rude.

So, my advice is:

ioquatix commented 4 years ago

Oh, I also wanted to add, auto update is a massive target for black hats, so completely agree with @SvanBoxel - hard to audit, hard to reproduce, hard to validate, etc.

bryanmacfarlane commented 4 years ago

We'll reflect on ☝️ for a bit. But note that there's a ton of separation right now where the runner knows nothing about yaml, the language and other service side / user features. that's all compiled / processed service side into the simplest "job" message which is one node of the whole run graph ( a set of steps). Those use contexts and are passed through fairly opaquely to the runner to minimize changes.

However there are times when there's a feature that's added which takes runner changes. As an example, js or container actions can have post job scripts and we're about to add pre. Another example would be side car containers for test (expressed in yaml but executed by the runner).

So there's actually multiple moving parts (1) the service (2) the runner and (3) the actions that 3rd parties are releasing. :o

So now, when an action packages a pre/script (start an emulator, shutdown etc.) they expect it to run and even though the outer workflow is run service side, actions run VM side and the steps are run by the runner.

In Azure (where this code started) we actually did a real complicated thing. We allowed the runners to stay back and only get update when the user used a new service side feature that demanded a new runner and we also allowed action authors to demand a min version of the runner. The service would compute a min based on features used and enumerate actions, calculate the min and if too old, send an update message. It still would get updated, just less often and much less predictably. Task authors couldn't figure it out, didn't get it right and it led to really hard to debug situations.

I think the fundamental design decision is the yaml is not run by the runners. It's run service side with a durable stateful orchestration which dishes out work to runners. That has alot of benefits with things like parallel graphs of nodes which dish out simple instructions to a runner. But it does mean that the runner binary is an RPC remote extension of that orchestration and actions are separate from that.

Completely open to refactoring the runner and ideas. But I think that the root of the problem is the multiple moving parts with the other two moving forward and getting new features. :)

If you got this far, thanks for reading my wall of text. Hope it provides some background.

ioquatix commented 4 years ago

Yeah, I can understand what you are saying.

I agree it's not simple and to be honest, I see both sides of that coin.

But as someone who was trying to put together a package for users, this was non-trivial to deal with.

Right now, all the files are installed into /var/lib/github-actions and owned by the user of the same name. It's not ideal.

So somehow we need to find a middle ground that allows someone like myself to package it up in a more traditional way, but also support things like on the fly updates.

Whether you just vendor all the self updating code into the runtime directory is kind of up to you/the engineers behind the code, but at least it should be possible. In essence, the package should install the core code, the system integration, the user account setup, systemd, pull in dependencies which may be provided by the system (nodejs, c#) etc, and then you might start up an instance, it would self update it's own scripts into the runtime directory, but this is largely state which could be wiped clean without affecting the package being reinstalled, running the update process in the future, etc.

myoung34 commented 4 years ago

It doesnt have to stop working in general if you do deprecation warnings saying "a new version is available" for one version.

Running X Get alert on "version y is available" Both X and Y work Z is released X stops working Y starts showing "version z is available"

would work for most scenarios. deprecation is hard for sure, but everyone would have an upgrade window in this scenario

bryanmacfarlane commented 4 years ago

We follow the model above when we (rarely) deprecate features. When we do it, that process ☝️ spans months.

However, this is not about deprecation. This is about new features getting added to the service (which users use when released. e.g. new yaml step construct) and the runner having to add new code (not deprecating) for that feature to work. This is also about the runner adding new features to support new capabilities for action authors (and yet others who use those actions and maintain runners).

SvanBoxel commented 4 years ago

An important consideration I would like us to think about is that companies in some industries are simply not compliant if they run software that hasn't been explicitly verified/validated. So having a service that runs on your own infrastructure that automatically updates could mean your business is not compliant.

An explicit upgrade process, or a way to opt-out of automatic updates, would solve this compliance conflict. I can give the example of software that is being used within companies that require FDA compliance need to go through this process before a service is being upgraded: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch.cfm?CFRPart=820&showFR=1&subpartNode=21:8.0.1.1.12.7

myoung34 commented 4 years ago

☝️ is what I was going to follow up with. All patches need to have a change control process of "some sort" in some regulatory environments

bryanmacfarlane commented 4 years ago

Yep. We've worked with those regulated environments in both azure and github.

In those scenarios, other things do not move forward. Everything is controlled or air-gapped including the on-premise server (TFS at MS, upcoming GHES at GitHub). It's not just the runner that's under scrutiny. The tasks/actions were also scrutinized and controlled.

In those scenarios, the runner (and actions) do not update since the service is also controlled. We also allowed for explicit but convenient update of the runner by placing the runner in a configured location.

Even in on-premises non air gapped, the service and actions do not continuously update so the runners can stay back. The minimum runner configured is recorded and control server / service side.

Which goes back to the key point: if the other parts (service and actions) move forward then the code that executes those features has to move forward.

bryanmacfarlane commented 4 years ago

Note that I think we should still look at investing in options here. The points above are clarifying where we currently are with the current model.

bryanmacfarlane commented 4 years ago

Also note in the short term if you are running the runner in a container, you should call the ./runsvc.sh entrypoint and not ./run.sh which is meant for interactive running. Calling ./runsvc.sh will at least handle the update and not exit.

In the medium term, we also want to support .container in the yaml when running in K8s so at least the container your job runs in (the steps) can stay static and is decoupled from the runner. That also means you don't have to build the runner into your containers. That's how it works with VM self-hosted - the runner on the host execs steps in the container you bring.

Finally, we should consider refactoring as @ioquatix referred to into the base runner (listener) and the steps portion which is derived from the service yaml orchestration.

ioquatix commented 4 years ago

@bryanmacfarlane one thing I considered several months ago was writing a compatible daemon in Ruby just for fun. Do you have any published specs for the GitHub Actions runner? I started reading through the C# source code but... trying to derive a protocol/specification by reading through C# is not really my thing.

bryanmacfarlane commented 4 years ago

@ioquatix interested in go? :). We have considered doing that for more platform coverage. We have some design docs in this repo and I could share others but honestly a discussion might be the most productive if you're interested. You could also watch the wire with fiddler or charles proxy in the middle - it's mostly just json over http. I'm also interested in your ideas to make it more inline with unix patterns you mentioned above. send me an email - it's my handle and that's at github.

pietroalbini commented 4 years ago

Another scenario where disabling auto-updates is needed is when dealing with forks of the runner. I'm preparing the deployment of a fork to address some security issues we have when using the runner in our environment, and an update would rollback those mitigations we put in place.

We currently have a patch that should prevent updates, but having upstream support would be better.

ioquatix commented 4 years ago

@bryanmacfarlane I emailed you but never heard back...?

npalm commented 3 years ago

Last week we had several problems with the auto update behavior of the runner:

  1. Jobs for an old runner got stuck, #575
  2. Updating with pre-release version, #581
arash-bizcover commented 3 years ago

@bryanmacfarlane How can I pass --once to the runsvc.sh ?

I'm using ./run.sh --once which help me have clean new containers after each build.

BTW on the latest release: v2.272.0 runsvc.sh is inside bin bin/runsvc.sh

mhart commented 3 years ago

Just chiming in here to also ask for an option to still receive build messages on older versions without forcing an update.

Specifically for use cases where the binaries are installed in read-only directories – auto-updates are just impossible in this case.

bryanmacfarlane commented 3 years ago

Not in the plans that I know of. Please read the the whole discussion in this issue on the current approach. TLDR is the service runs the workflow, not the runner and it moves forward so it's not about back compat etc, it's about new features on the service which roll forwards and needs runner changes. So if the runner does update itself at some point (regardless of policy, controls etc.) it can't be in a read only directory. I'm also not saying this can't change in the future, this issue discussion hopefully enumerates how it currently works and the current limitations.

@hross is the now the lead for the runner so adding him to this (long) discussion as an fyi.

mhart commented 3 years ago

At the moment I mock the upgrade (ie, when an AgentRefresh message comes in, I restart the agent session with the updated version so that the runner can still continue to receive messages). Then apply the update manually when time permits.

The chance of a user using a new feature in this time window is very low, so the impact should be low – massive breaking changes notwithstanding. Worst case a build fails and the update needs to be applied there and then. But (in my case at least) not as bad as builds suddenly unable to proceed at all (with no UI feedback afaict)

Obviously not sanctioned, hence why I'm adding my +1 for this to be a supported option, but I have no other choice in read-only environments.

bryanmacfarlane commented 3 years ago

@mhart The worst case isn't the a failed build. The worst case would be (1) using a new capability (feature in yaml) that silently isn't honored but the service sent it to the runner because it believed it could handle it or (2) an abandoned job (runner goes away) - e.g. the runner code holds perfect back compat but it can't handle a future change the server sent in a job message but it sent it because it believed it to be up to date. The other odd failures you could would be actions that don't work which expect a var or something to be set. So there's actually three moving parts here - the service moves forward and actions move forward - the runner works with both of those. GHES is the answer for not moving forward (or controlling how everything moves forward).

mhart commented 3 years ago
  1. Is the failed build I was talking about (ie, the runner will fail)
  2. I'm a bit unclear here – can you explain why it doesn't result in a failed build?
arash-bizcover commented 3 years ago

Multiple versions should be supported, This is a very basic requirement of any server-client app. There should be multi-version servers running on Github side.

lomkju commented 3 years ago

The runner updates are super slow even when using runsvc.sh which clogs up our pipeline. Also, we need to use custom AMIs which require packer bakes every time there is an update for the runner. From a security standpoint, I wouldn't want anything running in my host machines without an audit.

freekvh commented 3 years ago

I'd like to add my 2 ct: We submit runners on a batch processing cluster and the runners never come back up after updating and stopping themselves. They report idle without taking any jobs for quite some time before actually going offline. The only thing that works is to kill the runner, start it on the cli, wait for it to fully update, stop it and resubmit it.. repeat for every update.

After reading this thread I think should also be using ./bin/runsvc.sh.

ldub commented 3 years ago

To pile onto this issue with another use case: we use NixOS to host our services, and that requires using patchelf (a small utility to modify the dynamic linker and RPATH of ELF executables) to run binaries. Auto-updating breaks this because the new downloaded binaries are not patched and can't run.

@bryanmacfarlane is there a roadmap/release cycle that you have? like, if we could have a few days warning (in the happy case) before a new GH actions release, this would be really helpful.

matthewhembree commented 3 years ago

I'm a bit concerned about the runner updating to a pre-release version. When I build my container to get around the auto-update autokill, I query https://api.github.com/repos/actions/runner/releases/latest. Currently, this returns a version that needs to update itself.

tetchel commented 3 years ago

I also did a double-take at the "pre-release" tag on the last 3 versions: v2.277.0, v2.277.1, and v2.278.0. It's not clear to me why those specifically are pre-releases.

As far as I can tell, only the Linux arm/arm64 builds are pre-release, but those were added prior to the listed versions.

matthewhembree commented 3 years ago

Promises 👀 Promises 🔍
Promises 🙄

pietroalbini commented 3 years ago

Rust's builders also got stuck yesterday when a pre-release was pushed as an update to our runners. We can't use auto-updates for security reasons.

nehagargSeequent commented 2 years ago

Is there any update on the issue? I am using Azure Kubernetes to host the runners and every time a build is triggered, the runners first go offline to auto-update which is super annoying.

bryanmacfarlane commented 2 years ago

@pietroalbini I think you may be confusing the runner application (this repo) and the hosted images. Even if you could pin this application, the OS images and all the toolsets slide underneath it so I don't think this issue will address your issue and other's concerns.

pietroalbini commented 2 years ago

@bryanmacfarlane the reason Rust can't use auto-updates is not related to the hosted images, but it's a byproduct of running a custom runner fork as a workaround to another runner limitation.

The main problem Rust has with the self-hosted runner is https://github.com/actions/runner/issues/494 (the runner executing builds from PRs causing security issues), and to work around it we're using a fork of the runner that rejects any build coming that's not a push event. In isolation this works fine and avoids the security issue, but running that fork means we can never use the auto-update feature: any auto-update would update to the upstream version of the runner, which doesn't have the security protections we added to our fork.

In our fork we also added a change to prevent auto-updates, but since the GitHub backend only sends jobs to a runner if it's up to date, preventing auto-updates that way effectively disables the runner whenever new runner release is made. We have to watch for new releases and rebase our fork before the updat e rolls to our organization, and when we don't get around to it fast enough our queue stops, which is annoying.

patryk-kozak-branch commented 2 years ago

@bryanmacfarlane @hross I feel that for v2.280.1 and v2.280.2 there was at least couple of days grace period after release, as runners with older version could still connect and work, didn't need to immediate upgrade them. However version released today, v2.280.3 almost immediately broke (~5hr after release) runners and CI for entire company. Is there some kind of default time/scheduled plan for releases before they get applied and are forcing runners to be upgraded?

Do you have plans to create some kind of channel of communication for upgrades? With grace period to upgrade? Or to support X last versions for compatibility? Sorry for all the different questions, but knowing answers to those questions are critical in real world application of GitHub CI and it usefulness 😉

TingluoHuang commented 2 years ago

@patryk-kozak-branch Sorry about what you experienced, do you have any runner diag log still available for me to check? I would like to understand why the 2.280.3 upgrade breaks everything for you. 🙇

jenschelkopf commented 2 years ago

Customer feedback issue: https://github.com/github/customer-feedback/issues/3439

myoung34 commented 2 years ago

@jenschelkopf is that private? Im unable to see that issue