lingtran commented 3 years ago

Goal

Understand the feasibility and complexity of leveraging the self-hosted runners launch template that VSP spent a couple months creating. Currently a couple VA teams are now 100% running on - the content-build and vets-website teams. It takes minutes to spin up ECS containers in our pipelines, which creates a painful feedback cycle. We are interested in considering if the amount of time put into and using self-hosted runners will reap more benefits down the road, including bringing down this feedback cycle time. (goal and assumption)

Checklist

[ ] Note assumptions going into spike and throughout the spike - this is a potential gauge for complexity
[ ] Discuss and document with team
[ ] Create recommendations

Timebox

Estimate how long this should take, in days. Typically 3.

Assumptions

Additional Info/Resources

This spike was born out of some prelim research done for UI card #14 Links to self-hosted runners stood up by VSP/Demian Ginther
https://github.com/department-of-veterans-affairs/devops/tree/master/packer/gha-runner
- run this from GitHub actions
documenation Demian wrote up
files of interest:

Notes from Ling's chat with Demian Ginther on July 1, 2021:

Runners are designated per repo
There is a Github Action that builds the runners
It is a separate process for auto-scaling group to launch them from launch template
Runner was built using terraform
Demian shared Terraform and use Packer to build the AMI OR
Learnings / gotchas that came up in the development process:
- ruling multiple runners in the same instance
- each job in the workflow can use a separate runner process that are separate from each other, by default. But in this instance, each job takes an entire ec2 instance….can get around this by spinning up different size runners (this is a bit of an overhead), or run as many runners on one instance as you want (but can lead to disk space issues due to caching), and need to 1docker prune volumes` once in a while. If using service containers like Postgres and want to bind to a port in the host, if another job tries to run on that same host, can’t use it because port is already bound due to no compartmentalization in a runner instance
- self-hosted runners can be run in a Docker container, but GitHub does not support running runners inside Docker officially
- there is a kubernetes operator that runs an elks cluster that can spin up ephemeral runners
  - the one gotcha is that a lot of users having to run on separate runners and then upload the data to the archive, and then re-download them to synthesize the results (content-build and vets-website)

Why not lambdas?

lambdas are hard to troubleshoot

General philosophy is the more you can use GitHub actions the better

Self-hosted runners was a way to get around the vaec network constraint
allows us to run larger runners than default GitHub runners
running 4 runner processes on each instance
some spin-up time for the instances, but it can be wasteful at night cause it rus(there is work to address this issue)

Demian adapted Packer build from Github…rewrote from Azure to AWS, with some additions

Runners as is should work for us (VANotify) - so can use the same packer script. The AMI comes up and self-registers with Github as repo, so not a lot of tweaking

Out of Scope

-

Open Questions

Is implementation out of scope?

lingtran commented 3 years ago

@miabecker created this spike card that was born out of the notification-ui#14 card.

miabecker commented 3 years ago

Got it - thank you!

department-of-veterans-affairs / notification-api

[SPIKE] Spin up self-hosted runners for VANotify using VSP assets/resources #519