Open Gregory-Pereira opened 5 months ago
Rather than maintaining individual runners manually, we are looking into dynamically provisioning ec2
instances for the runners on spot instances. There were 2 libraries identified that we could leverage in our implementation:
After initial discussion we decided to proceed with a POC using the terraform
based implementation due to contribution velocity / resources of the project, in addition to some initial discussions around support for Darwin builds: https://github.com/philips-labs/terraform-aws-github-runner/issues/2069#issuecomment-1133335130.
It was determined steps in order to complete this feature request:
amd64
and arm64
builds on linux using the terraform based repoOSX amd64
builds
OSX arm64
, as this is more important than OSX amd64
subscription-manager should be available on the self hosted runner to unlock installing RHEL packages when building RHEL based images
We would like to align those efforts with the instructlab and osbuild github.com organizations.
If you are interested in using Fedora CoreOS for the self hosted runners, I wrote this article a couple years ago ---> https://fedoramagazine.org/run-github-actions-on-fedora-coreos/. Using FCOS makes it really easy to spin up or down instances.
I'd like to inform and temper expectations relating to the proposed dynamic/ephemeral runner solution:
I do not want to be completely negative on this effort, so here are some (hopefully) constructive suggestions:
GitHub does not recommend this setup for public repositories, it's not safe. Assuming they know their own system better than anybody, I tend to trust their advise.
FWIW, I've been looking into adding ephemeral self-hosted runners to the podman-bootc repo. It seems this is more of an upfront warning to say do not use public self hosted runners unless you know what you are doing. AFAICT, a combination of using isolated/ephemeral runners and requiring approvals before running workflows from unknown contributors will significantly mitigate the risk.
https://docs.github.com/en/actions/managing-workflow-runs/approving-workflow-runs-from-public-forks
Some interesting discussion here too: https://github.com/orgs/community/discussions/26722#discussioncomment-3253085
will significantly mitigate the risk.
Agreed it probably does. I was just concerned about going into this effort mindful there are likely significant/impactful/non-obvious "gotchas" and pitfalls. Security and reliability issues included. Github is closed-source, they have no incentive to disclose all their reasoning for recommendations.
Thanks for the discussion link, I'll be sure to take a read through to educate myself.
@Gregory-Pereira @cevich I don't have the cycles to drive this effort. Is that something one of you can drive?
I believe @cooktheryan will be driving this effort when he gets back, but I will most certainly help him push it forward and or do the implementation given an agreed upon plan.
See @cgwalters's comment over https://github.com/containers/bootc/issues/496. Yet another reason to prioritize this effort.
The current repo mirror strategy to drive builds down is not scaleable. We should look to move to using self-hosted Github runners where we can mount the models, stored on persistent storage, to the filesystem in such a way that our tests will not run out storage, and will not have flakiness due to multi-gigabyte model downloads. Even if we could limp along with our current solution, swapping to this strategy will be a requirement of testing our multi-model feature in
llamacpp_python model_server
.Initial idea was discussed in the thread beginning with: https://redhat-internal.slack.com/archives/C06S75ZF9JT/p1713089733094399?thread_ts=1712828397.645709&cid=C06S75ZF9JT .
We plan to implement this after Release
1.0
so as to not interfere, but the POC can be developed and run alongside our workloads leading up to and during release./assign @lmilbaum /assign @Gregory-Pereira