coreweave / ml-containers

MIT License
19 stars 3 forks source link

Scaling up self-hosted GitHub Actions runners #25

Open Eta0 opened 1 year ago

Eta0 commented 1 year ago

ml-containers uses a self-hosted GitHub Actions runner to build container images through CI. It is currently only capable of handling one job at a time, sequentially. As a consequence, complex builds with many variations such as ml-containers/torch are taking up to 7 hours per commit to finish their CI.

Very heavy commits slow down development, as it makes iteratively fixing bugs in a CI deployment impractical.

Either dedicating more resources to keep runners available or implementing some form of autoscaling like with actions-runner-controller may improve on the situation.

todie commented 1 year ago

Alright so after some false starts, a little research, and a chat with @ChandonPierre, I'm going to try registering an auto-scaling solution inside a virtual cluster. Falling back to punting with a large dedicated virtual server for this repository.

salanki commented 1 year ago

Let's spin up some more VMs, I Think they can simply be scaled up.

todie commented 1 year ago

Sounds like a plan. punting with vms 🏈