CloudSnorkel / cdk-github-runners

CDK constructs for self-hosted GitHub Actions runners
https://constructs.dev/packages/@cloudsnorkel/cdk-github-runners/
Apache License 2.0
255 stars 37 forks source link

[Feature request] EC2 Fleet provider #394

Open riywo opened 10 months ago

riywo commented 10 months ago

The current EC2 provider is awesome. I'm just wondering we could implement a similar provider but using EC2 Fleet as an actual provisioner. It is neat because it can configure multiple instance types and purchase options with built-in strategies and would be super useful to find the lowest spot price without complicated logic, for example.

If you don't have any plan yet. I could work on it by implementing a new provider like Ec2FleetProvider but trying to align with the props of Ec2Provider as much as possible so that users can easily switch from one to the other.

Let me know your opinions!

kichik commented 10 months ago

Last time I tried to use EC2 Fleet I found this:

https://github.com/CloudSnorkel/cdk-github-runners/blob/da6a0c0fc497ccada9fb83cd03d7c0b2b96ffb4c/src/providers/ec2.ts#L379

If you find a good way to get around those issues, I'd be happy to merge a PR for it. We might be able to do it with a Lambda function. I have been trying to avoid those to keep the Step Function simple. But we have so many Lambda functions around, how bad can one more be?

cynicaljoy commented 6 months ago

@kichik you can define a Launch Template with User Data, Security Groups, Disks, etc and then use that Launch Template in the Fleet Request.

Sorry, I don't have time currently to work up a full PR, but something like this:

     new stepfunctions_tasks.CallAwsService(this, "FleetRequest", {
      comment: "FleetRequest",
      integrationPattern: IntegrationPattern.WAIT_FOR_TASK_TOKEN,
      service: 'ec2',
      action: 'fleetRequest',
      heartbeatTimeout: stepfunctions.Timeout.duration(Duration.minutes(10)),
      iamResources: ['*'],
      parameters: {
        TargetCapacitySpecification: {
          TotalTargetCapacity: 1,
          OnDemandTargetCapacity: 0,
          SpotTargetCapacity: 1,
          DefaultTargetCapacityType: 'spot',
        },
        LaunchTemplateConfigs: this.subnets.map((subnet, _) => {
          return {
            LaunchTemplateSpecification: {
              LaunchTemplateId: this.ami.launchTemplate.launchTemplateId,
              Version: this.ami.launchTemplate.versionNumber.toString(),
            },
            Overrides: {
              InstanceType: this.instanceType.toString(),
              IamInstanceProfile: {
                Arn: instanceProfile.attrArn,
              },
              InstanceInitiatedShutdownBehavior: ec2.InstanceInitiatedShutdownBehavior.TERMINATE,
              SubnetId: subnet.subnetId,
              MetadataOptions: {
                HttpTokens: 'required',
              },
              SecurityGroupIds: this.securityGroups.map(sg => sg.securityGroupId),
              BlockDeviceMappings: [{
                DeviceName: rootDeviceResource.ref,
                Ebs: {
                  DeleteOnTermination: true,
                  VolumeSize: this.storageSize.toGibibytes(),
                },
              }],
              UserData: stepfunctions.JsonPath.format(
                stepfunctions.JsonPath.stringAt('$.ec2.userdataTemplate'),
                ...params,
              ),
            }
          }
        }),
      },
    });
kichik commented 6 months ago

@cynicaljoy where do you see UserData being overridable? That's exactly the missing part. That's where we put all the runner configuration.

cynicaljoy commented 6 months ago

@kichik err sorry, napkin code. I'm translating this from a Go utility service I have. How I'm handling that is I first create a new version of the LaunchTemplate w/the updated UserData and then I use that version of the Launch Template in the Fleet request. Then I cleanup the oldest version of the LaunchTemplate to ensure that we don't grow the LaunchTemplate versions beyond the account limit which is 1000 by default.

kichik commented 6 months ago

I don't love creating a new resource for every runner. Combined with the idle reaper, that may cause launch templates to be left behind and reach that 1000 limit. It can also cause issues with deployments as we will be modifying resources created by CloudFormation.

The "cleanest" method (and those quotes are working real hard) I could think of so far was to create a queue for each fleet runner type. The instances could then pull their next runner task from the queue when they boot up. It will take real careful work to avoid having old jobs remain in the queue. What will take them off the queue if EC2 fails to launch, for example?

cynicaljoy commented 6 months ago

what about making the user data more idempotent? rather than the params being populated into the template and added to the launch template, those values are added to param store or secrets manager w/the task token as part of the key and the task token is added to the resource tags. then the idle reaper can clean up.

kichik commented 6 months ago

The problem with param store/secrets manager is that we need to be able to spin up multiple instances at the same time. They will override each other.

cynicaljoy commented 6 months ago

Am I looking in the right place? (PowerShell below that too of course).

I was thinking the workflow would be to create unique secret item (whether Secrets Manager or Parameter Store): e.g./cdk-runners/$ownerPath/$repoPath/$TASK_TOKEN/ (maybe more to ensure idempotence) and the variable parts would be Resource Tags. So the UserData would be updated to read the Resource Tags to know where to pull that runs params. Then the secret store would be deleted at the end of the user data.

Of course, this would be mean some additional required IAM scope and the secret item path should be configurable.

kichik commented 6 months ago

Am I looking in the right place? (PowerShell below that too of course).

Yes.

I was thinking the workflow would be to create unique secret item (whether Secrets Manager or Parameter Store): e.g./cdk-runners/$ownerPath/$repoPath/$TASK_TOKEN/ (maybe more to ensure idempotence) and the variable parts would be Resource Tags. So the UserData would be updated to read the Resource Tags to know where to pull that runs params. Then the secret store would be deleted at the end of the user data.

Of course, this would be mean some additional required IAM scope and the secret item path should be configurable.

Any resource being created by the step function (workflow) will pose a problem as it is implemented right now. Idle reaper will abort step functions which means created resources can be left behind.

CreateFleet does allow overriding instance resource tags for instant request type. So that could work. We can put the variables in tags and read them from IMDS.