Convert Mirror Repo strategy to self-hosted github Runners

Gregory-Pereira commented 5 months ago

The current repo mirror strategy to drive builds down is not scaleable. We should look to move to using self-hosted Github runners where we can mount the models, stored on persistent storage, to the filesystem in such a way that our tests will not run out storage, and will not have flakiness due to multi-gigabyte model downloads. Even if we could limp along with our current solution, swapping to this strategy will be a requirement of testing our multi-model feature in llamacpp_python model_server.

Initial idea was discussed in the thread beginning with: https://redhat-internal.slack.com/archives/C06S75ZF9JT/p1713089733094399?thread_ts=1712828397.645709&cid=C06S75ZF9JT .

We plan to implement this after Release 1.0 so as to not interfere, but the POC can be developed and run alongside our workloads leading up to and during release.

/assign @lmilbaum /assign @Gregory-Pereira

Gregory-Pereira commented 5 months ago

Rather than maintaining individual runners manually, we are looking into dynamically provisioning ec2 instances for the runners on spot instances. There were 2 libraries identified that we could leverage in our implementation:

After initial discussion we decided to proceed with a POC using the terraform based implementation due to contribution velocity / resources of the project, in addition to some initial discussions around support for Darwin builds: https://github.com/philips-labs/terraform-aws-github-runner/issues/2069#issuecomment-1133335130.

It was determined steps in order to complete this feature request:

start with testing just amd64 and arm64 builds on linux using the terraform based repo
@Gregory-Pereira to identify if his spare mac mini can run as a dedicated runner to solve intermediary OSX amd64 builds
- Look for any intermediary solution around OSX arm64, as this is more important than OSX amd64
Contribute the upstream changes in the terraform based repo around enablement for Darwin builds
Update our CI to use these new Darwin builds once it merges upstream

lmilbaum commented 4 months ago

subscription-manager should be available on the self hosted runner to unlock installing RHEL packages when building RHEL based images

lmilbaum commented 4 months ago

We would like to align those efforts with the instructlab and osbuild github.com organizations.

lmilbaum commented 4 months ago

cverna commented 4 months ago

If you are interested in using Fedora CoreOS for the self hosted runners, I wrote this article a couple years ago ---> https://fedoramagazine.org/run-github-actions-on-fedora-coreos/. Using FCOS makes it really easy to spin up or down instances.

cevich commented 4 months ago

I'd like to inform and temper expectations relating to the proposed dynamic/ephemeral runner solution:

GitHub does not recommend this setup for public repositories, it's not safe. Assuming they know their own system better than anybody, I tend to trust their advise.
Dynamic + ephemeral runners require a bot/App with admin access. This is probably "okay" for a single repo, it's going to be a very "hard sell" to force it upon the whole org. A compromise provides unlimited/unrestricted access to everything!
A dedicated Cloud project should be used for this. Given point 1, were some process to escape and gain access to cloud resources, the potential damage-radius should be kept as small/minimal as possible.
Please consider how the overall setup is to be monitored long-term. Passing/Failing jobs isn't likely good enough. There should be some telemetry from the systems that verifies their "short" lifespan.
Complex systems really benefit from good, up-to-date documentation. Others maintainers will likely become involved, and without documentation they will become incredibly frustrated figuring everything out from scratch.

I do not want to be completely negative on this effort, so here are some (hopefully) constructive suggestions:

Could statically provisioned, manually registered runners be "good enough"? No bot's need admin, and it's very simple to maintain, document, and monitor.
Can this repo be moved to a less-populated org, where org-wide admin access is a smaller risk?
Perhaps there is a simpler CI system that could work? Even better if it's maintained and monitored by a dedicated team and/or doesn't require "runners".
Could GHA be used differently, for example directly provision its own cloud resources and orchestrate workloads itself?

ckyrouac commented 4 months ago

GitHub does not recommend this setup for public repositories, it's not safe. Assuming they know their own system better than anybody, I tend to trust their advise.

FWIW, I've been looking into adding ephemeral self-hosted runners to the podman-bootc repo. It seems this is more of an upfront warning to say do not use public self hosted runners unless you know what you are doing. AFAICT, a combination of using isolated/ephemeral runners and requiring approvals before running workflows from unknown contributors will significantly mitigate the risk.

https://docs.github.com/en/actions/managing-workflow-runs/approving-workflow-runs-from-public-forks

Some interesting discussion here too: https://github.com/orgs/community/discussions/26722#discussioncomment-3253085

lmilbaum commented 3 months ago

WIP proposal doc - https://docs.google.com/document/d/1SmV8Y0qzk5nphamXuI-WccNWgi4ZPKn7_TJ_EVApRK0

cevich commented 3 months ago

will significantly mitigate the risk.

Agreed it probably does. I was just concerned about going into this effort mindful there are likely significant/impactful/non-obvious "gotchas" and pitfalls. Security and reliability issues included. Github is closed-source, they have no incentive to disclose all their reasoning for recommendations.

Thanks for the discussion link, I'll be sure to take a read through to educate myself.

lmilbaum commented 3 months ago

@Gregory-Pereira @cevich I don't have the cycles to drive this effort. Is that something one of you can drive?

Gregory-Pereira commented 3 months ago

I believe @cooktheryan will be driving this effort when he gets back, but I will most certainly help him push it forward and or do the implementation given an agreed upon plan.

lmilbaum commented 3 months ago

See @cgwalters's comment over https://github.com/containers/bootc/issues/496. Yet another reason to prioritize this effort.

containers / ai-lab-recipes

Convert Mirror Repo strategy to self-hosted github Runners #264