Re-use EC2 runner instances - Githubissues

CloudSnorkel / cdk-github-runners

CDK constructs for self-hosted GitHub Actions runners

https://constructs.dev/packages/@cloudsnorkel/cdk-github-runners/

Apache License 2.0

255 stars 37 forks source link

Re-use EC2 runner instances #191

Closed moltar closed 1 year ago

moltar commented 1 year ago

Does it make sense to launch multiple EC2 instances for each workflow?

This really slows the process down. I think ideally it should launch one instance, and then re-use it for all runs, and then when idle - shut down.

kichik commented 1 year ago

We currently focus on ephemeral runners where one job doesn't affect the status of the next one. Leaving a runner behind means every package you installed in job 1, will be available in job 2. That can cause unexpected and unpredictable results.

In the future we may add API for a pool of prewarmed runners or even runners that stay around. But right now the focus is on-demand runners.

moltar commented 1 year ago

Agreed on stateless.

Would it be possible to run an EC2 with a Docker daemon that starts ephemeral containers for jobs?

kichik commented 1 year ago

How would that be different than CodeBuild/Fargate/Lambda providers?

moltar commented 1 year ago

Ability to run Docker in Docker (dind)

moltar commented 1 year ago

Faster start times. Lambda is fast. But CodeBuild is very slow to provision. Same for fargate.

kichik commented 1 year ago

I think solving this with pre-warmed instances will be simpler. This way we don't have to create another system to manage capacity and running containers in EC2.

moltar commented 1 year ago

Wouldn't pre-warmed instances eat up all of the savings then?

E.g. in our case, for every PR, we run 12 GitHub Workflows. This means 12 instances will need to be always on stand-by. Otherwise, even if we needed to wait for one, it will delay the entire PR readiness (all tests need to pass).

kichik commented 1 year ago

Yes, using pre-warmed instances, or an always-on instance, would eat up the savings of on-demand. AWS prices scale linearly with instance size. So they both solutions would eat up savings in a similar manner, unless you wish to overprovision on the shared instance. But either way you don't get the cost savings of on-demand.

This is beginning to sound like ECS. Do you know if ECS with an already active instance provisions faster than Fargate? Adding ECS support should be relatively easy and I think it will give us a shared instance without much work or orchestration code to maintain in the future.

kichik commented 1 year ago

I gave ECS a try and found that it starts the container in about a second. When it also has to pull the image, it takes about 20 seconds to start the container. We can probably pre-pull the image. This assumes a running instance and no need to scale-up and create more instances.

Both cases take an extra 10 seconds or so for the runner to connect after the container starts. That's just how long the runner code takes and this might be optimized in the future by GitHub.

Some code for the future (doesn't handle auto-scaling by container number yet):

new ecs.Cluster(
  stack,
  'ecs cluster',
  {
    enableFargateCapacityProviders: false,
    capacity: {
      instanceType: ec2.InstanceType.of(ec2.InstanceClass.M5, ec2.InstanceSize.LARGE),
      minCapacity: 1,
      maxCapacity: 2,
      desiredCapacity: 1,
      vpcSubnets: { subnetType: SubnetType.PUBLIC },
      spotInstanceDraining: false, // waste of money to restart jobs as the restarted job won't have a token
    },
    vpc: vpc,
  },
  // TODO ECS_IMAGE_PULL_BEHAVIOR=prefer-cached
);
const task = new ecs.Ec2TaskDefinition(
  stack,
  'task',
);
task.addContainer('runner', {
  image: ecs.ContainerImage.fromEcrRepository(fargateX64Builder.bind().imageRepository),
  user: 'runner',
  memoryLimitMiB: 2048,
  command: ['./config.sh', '--unattended'],
  logging: ecs.LogDriver.awsLogs({
    streamPrefix: 'runner',
  }),
});

RichiCoder1 commented 1 year ago

Maybe ECS autoscaling + a simple daemon to precache builder images?

You'd still get extra long cold starts for scale up + image pull, though you could potentially mitigate that w/ Warm Pools which let you balance the performance/price tradeoff

automartin5000 commented 1 year ago

I was surprised to see the on-demand behavior of the EC2 Runner Provider. My understanding was that GitHub Actions runs in Docker containers, so I was expecting that Actions run on EC2 Runners would similarly use Docker and inherently be ephemeral. I would second the notion of using ECS to manage runners on EC2.