feat: add additional labels

pharindoko commented 6 months ago

Hey @kichik,

we have one issue with runners on org level (I assume as well on repo level). When multiple jobs run in parallel it seems like it can happen that the wrong job gets the runner and some jobs are left over still waiting for a runner.

I tried to add additional labels to the runner labels but have seen that this will be blocked by the webhook script. https://github.com/CloudSnorkel/cdk-github-runners/blob/0fc4be5f4ca524d4fe9f0c5056ff5466c2f89332/src/webhook-handler.lambda.ts#L148

feature-request: I would like to have the possiblity to add additional labels to a workflow. [self-hosted, large-runner, additional-label-for-workflow, additional-label-for-repo, additional-label]

Would mean you only check for required labels [self-hosted, large-runner] if those match and then forward it.

kichik commented 6 months ago

Can you please include the specific labels used? It will be easier to discuss with a concrete example.

pharindoko commented 6 months ago

As mentioned this might occur no matter if the runner are registered on org or repo level. But it will happen more often because more jobs are executed.

Background:

We register the runners now on org level - means those are visible to all repositories the github app is configured for in an organization.

The runner has label [self-hosted, large-runner] We have different workflows in two different repositories A,B in the same org that use these labels.

We trigger workflow A in repository A using workflow dispatch. This workflow A gets queued and queued...

In the meantime another workflow B that listened for a runner already for 15 hours having the label [self-hosted, large-runner] takes over this runner. It can have several reasons why this happens e.g. user missed for organizations to select the runner group setting to support Allow public repositories and has a public repository.

I can see this directly in the cloudwatch logs because in this case the workflow job failed and returned Job FrontendDeployment completed with result: Failed

Job FrontendDeployment is only availabe in repository B for workflow B. Stepfunction succeeded, EC2 terminated... Workflow A still waits for a runner ...

kichik commented 6 months ago

That would be one of the organization level registration cons:

https://github.com/CloudSnorkel/cdk-github-runners/blob/994f474deb38c904d94d2ad824b0493bac09d942/setup/src/App.svelte#L333C1-L334C49

You can bring up more runners to take over the stolen ones by creating a new step function execution. Open an existing one from the UI and hit the New Execution button.

I'm still not sure what you originally meant with the labels. Where would the label be added to resolve this?

pharindoko commented 5 months ago

I assume you can have this issue no matter if it`s org level or repo runner. The propability is just higher on org level. Furthermore you have an issue because as a user you might not see all repos in the org and it`s quite hard to find jobs that got stuck and still listen for a runner if you have a lot of repos in one org.

Here`s a sample for a solution proposal:

Runner in cdk solution created with labels: [label-for-instance-type]
User creates a workflow with [self-hosted, label-for-instance-type, label-for-repository]
Workflow gets triggered and webhook.lambda.ts is triggered with labels [self-hosted, label-for-instance-type, label-for-repository] matchLabelsToProvider checks only if the runner labels are included: ["self-hosted, label-for-instance-type"]
Stepfunction determines a runner based on the labels that match a runner: ["self-hosted, label-for-instance-type"]
All labels [self-hosted, label-for-instance-type, label-for-repository] are injected into runner and attached to the config.sh command.
Runner is registered in github with labels [self-hosted, label-for-instance-type, label-for-repository]

This would mean that runner labels and workflow labels can be different and you can attach additional workflow labels.

I`ll try to create a PR and test this to see if this could work. @kichik please let me know if you have concerns ...

kichik commented 5 months ago

That won't prevent other repositories from using label-for-repository and still stealing the runner. Our label situation is already not the simplest to understand. I think this will make it even harder and still won't solve runner stealing.

pharindoko commented 5 months ago

well it`s not like people steal runners by intention. And the label-for-repository is just a sample for a tag.

You could use multiple tags and choose whatever you want. But it could happen that people just copy / paste those labels from another github action from another repository by mistake. It would still heavily reduce the propability that one runner can steal another one. And you can trace it in the logs as well.

pharindoko commented 5 months ago

btw if you know a better solution for this dilemma, please let me know :)

kichik commented 5 months ago

But it could happen that people just copy / paste those labels from another github action from another repository by mistake.

Exactly.

Have you considered creating multiple providers all using the same settings and image builder but with different labels? That feels like basically what you're asking for. Each runner will start with exactly the labels you want and for just the repositories you want.

btw if you know a better solution for this dilemma, please let me know :)

Repository level registration is the only thing I found. I do wish GitHub would let us aim specific runners at specific jobs, but they have no such option.

pharindoko commented 5 months ago

But it could happen that people just copy / paste those labels from another github action from another repository by mistake.

Exactly.

Have you considered creating multiple providers all using the same settings and image builder but with different labels? That feels like basically what you're asking for. Each runner will start with exactly the labels you want and for just the repositories you want.

No chance. Too much repos at all. This is no fun even when you have more than 10 repos.

btw if you know a better solution for this dilemma, please let me know :)

Repository level registration is the only thing I found. I do wish GitHub would let us aim specific runners at specific jobs, but they have no such option.

What I struggle with is the administration permission on repo level. A single permission to manage runners on the repository level is still missing.

Another option

Add a schedule feature to start additional runners on a regular base by

querying cloudwatch logs to get organizations that used the runner solution e.g. the last 2 weeks and start runners
querying the github instance to get all jobs that queued longer than e.g. 6 hours and start runners

The idleTimeout setting will kill those ec2 and deregister the runner by default if no job is taken after 5 minutes, right ?

@kichik please give me feedback I could try to create a draft PR for this topic.

kichik commented 5 months ago

No chance. Too much repos at all. This is no fun even when you have more than 10 repos.

Wouldn't you have the same problem with your proposed solution? You'd have to specify all those labels somewhere. Maybe your CDK code can automatically generate it using gh?

The schedule option is something I have considered often, for many reasons. I am still not sure exactly what I want it to look like or if it should even be a schedule and not some kind of monitoring system for stolen/missing runners.

You can already define a schedule manually, if you need to. I have done something similar for one of my installations.

const sm = runners.node.findChild('Runner Orchestrator') as stepfunctions.StateMachine;
const rule = new events.Rule(this, 'Start Runner Rule', {
  schedule: events.Schedule.rate(cdk.Duration.hours(1)),
  targets: [
    new targets.SfnStateMachine(
      sm, 
      {
        input: events.RuleTargetInput.fromObject({
          "owner": "my-org",
          "repo": "my-repo",
          "jobId": 0,
          "jobUrl": "manual-runner",
          "installationId": xxx, // TODO get from github secret
          "labels": {
            "self-hosted": true,
            "something": true, // TODO update
          },
          "provider": "stack-name/provider-id" // TODO update
        }) 
      }
    )
  ],
});

pharindoko commented 5 months ago

No chance. Too much repos at all. This is no fun even when you have more than 10 repos.

Wouldn't you have the same problem with your proposed solution? You'd have to specify all those labels somewhere. Maybe your CDK code can automatically generate it using gh?

It doesn`t happen all the time right. And for most of the users it`s quite ok how it works even with org runners. But for some of them it`s not acceptable when it`s used for some production deployments. And for those it would be fine if they need to attach additional labels to make it more reliable.

The schedule option is something I have considered often, for many reasons. I am still not sure exactly what I want it to look like or if it should even be a schedule and not some kind of monitoring system for stolen/missing runners.

You can already define a schedule manually, if you need to. I have done something similar for one of my installations.
const sm = runners.node.findChild('Runner Orchestrator') as stepfunctions.StateMachine;
const rule = new events.Rule(this, 'Start Runner Rule', {
  schedule: events.Schedule.rate(cdk.Duration.hours(1)),
  targets: [
    new targets.SfnStateMachine(
      sm, 
      {
        input: events.RuleTargetInput.fromObject({
          "owner": "my-org",
          "repo": "my-repo",
          "jobId": 0,
          "jobUrl": "manual-runner",
          "installationId": xxx, // TODO get from github secret
          "labels": {
            "self-hosted": true,
            "something": true, // TODO update
          },
          "provider": "stack-name/provider-id" // TODO update
        }) 
      }
    )
  ],
});

Hmm that`s a nice idea. But ok I can`t do this statically in cdk code for all repos. I would get lost.

Create schedule to trigger a lambda. Create a lambda to query cloudwatch insights to get the owner, repo and labels aggregated for the last 2 weeks. Go through all rows found and trigger the statemachine. I`ll try. I still would like to have this as part of the cdk construct :)

kichik commented 5 months ago

It doesnt happen all the time right. And for most of the users its quite ok how it works even with org runners. But for some of them its not acceptable when its used for some production deployments. And for those it would be fine if they need to attach additional labels to make it more reliable.

You can create a provider just for those special cases then. Isn't that basically what you were asking for? What am I missing?

Hmm thats a nice idea. But ok I cant do this statically in cdk code for all repos. I would get lost.

Since you're registering on organization level, the repo part doesn't really matter. You can leave it as a placeholder. No need for Lambda or anything.

pharindoko commented 4 months ago

Ok, I found a solution that is quite ok I guess to circumvent this problem:

Added a construct with a step function and a few lambdas with following workflow:

On the left side I get all the information of the state machine logs about the jobs transmitted (repo, owner, job, labels)
On the right side I query github in a lambda authenticated as github app to get all the organisation installations
Based on the "owner"|"organisation" property I merge the information
I check the jobs and their current state (is the job queued for longer than 10 minutes ?)
I trigger new stepfunction executions to get new runners with the specific labels registered in the right organisation.

I trigger this step function each half an hour.

What would be great to get as output from the GithubRunners construct:

the statemachine arn (use to trigger new step function executions
the loggroup name (used for the cloudwatch insights query)

Currently this is what I need to execute the construct:

const stateMachine = githubRunners.node.children.find((child) => child instanceof stepfunctions.StateMachine)! as stepfunctions.StateMachine;
const logGroupName = `${this.stackName}-state-machine`;

new GithubScheduledRunnerHandler(this, 'github-scheduled-runner', {
      internalSubnetIds: props.internalSubnetIds,
      vpcId: props.vpcId,
      logGroupName,
      githubSecretId: githubRunners.secrets.github.secretName,
      githubPrivateKeySecretId: githubRunners.secrets.githubPrivateKey.secretName,
      mappingTable: providers.map((provider) => { return { "provider": provider.toString(), "labels": provider.labels } }),
      stateMachineArn: stateMachine.stateMachineArn,
      githubRunnerWaitTimeInMinutes: 10,
    });

pharindoko commented 4 months ago

Would you be interested to integrate this construct in your solution ? Should I add a PR or should I create a separate construct ? I could share the code in a private repo in a simplified version (not a perfect construct) if you want to get more details. Please let me know @kichik.

kichik commented 4 months ago

I do want something like this one day, but I don't think we have all the details figured out yet. For example, when using ECS with a limited ASG, a job might be pending for >10 minutes just because the cluster is busy. If we keep submitting more jobs, it may only make the situation worse.

I do like that it doesn't go off just by the webhook. I have seen cases where the webhook was just not called and so runners were missing.

kichik commented 4 months ago

Please do share the code, if you can. I think I may have misunderstood the matching you were doing between jobs and the existing state machine. It might already cover the problem I mentioned.

pharindoko commented 4 months ago

@kichik hope you have seen my invitation - created a private repo for this.

kichik commented 4 months ago

@kichik hope you have seen my invitation - created a private repo for this.

I did. Thanks.

pharindoko commented 4 months ago

After more than week of running the scheduler step function I found out there a clear correlation between the webhook lambda function error message ExecutionAlreadyExists and new runners triggered by my scheduler step function. It only seems to happen in this case.

Detailed error message in cloudwatch:

__type
com.amazonaws.swf.service.v2.model#ExecutionAlreadyExists
$fault
client
$metadata.attempts
1
$metadata.httpStatusCode
400
$metadata.requestId
f1cad492-8906-4967-9c99-f920a010bf90
$metadata.totalRetryDelay
0
@ingestionTime
1704816064769
@log
********:/aws/lambda/gh-runner-ghrunnergithubrunnerWebhookHandlerwebhoo....
@logStream
2024/01/09/[$LATEST]e18f1646a851483d93a0490663a2cde4
@message
2024-01-09T16:00:56.047Z f8bf04d7-561b-4d7a-bbd0-b66f23121b91 ERROR Invoke Error {"errorType":"ExecutionAlreadyExists","errorMessage":"Execution Already Exists: 'arn:aws:states:********:********:execution:ghrunnergithubrunnerRunnerOrchestrator********:******'-44b33b20-af08-'","name":"ExecutionAlreadyExists","$fault":"client","$metadata":{"httpStatusCode":400,"requestId":"f1cad492-8906-4967-9c99-f920a010bf90","attempts":1,"totalRetryDelay":0},"__type":"com.amazonaws.swf.service.v2.model#ExecutionAlreadyExists","stack":["ExecutionAlreadyExists: Execution Already Exists: 'arn:aws:states:********:*******:execution:ghrunnergithubrunnerRunnerOrchestrator********:*****'"," at de_ExecutionAlreadyExistsRes (/var/runtime/node_modules/@aws-sdk/client-sfn/dist-cjs/protocols/Aws_json1_0.js:1694:23)"," at de_StartExecutionCommandError (/var/runtime/node_modules/@aws-sdk/client-sfn/dist-cjs/protocols/Aws_json1_0.js:1321:25)"," at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"," at async /var/runtime/node_modules/@aws-sdk/middleware-serde/dist-cjs/deserializerMiddleware.js:7:24"," at async /var/runtime/node_modules/@aws-sdk/middleware-signing/dist-cjs/awsAuthMiddleware.js:14:20"," at async /var/runtime/node_modules/@aws-sdk/middleware-retry/dist-cjs/retryMiddleware.js:27:46"," at async /var/runtime/node_modules/@aws-sdk/middleware-logger/dist-cjs/loggerMiddleware.js:7:26"," at async Runtime.handler (/var/task/index.js:13997:21)"]}
@requestId
f8bf04d7-561b-4d7a-bbd0-b66f23121b91
@timestamp
1704816056047

kichik commented 4 months ago

The name uses GitHub webhook delivery id to remain unique.

https://github.com/CloudSnorkel/cdk-github-runners/blob/06ffc53c8067027748a4447d3f3b248fabb94131/src/webhook-handler.lambda.ts#L167

Do you maybe have the same webhook hooked up more than once somehow?

pharindoko commented 4 months ago

The name uses GitHub webhook delivery id to remain unique.

https://github.com/CloudSnorkel/cdk-github-runners/blob/06ffc53c8067027748a4447d3f3b248fabb94131/src/webhook-handler.lambda.ts#L167

Do you maybe have the same webhook hooked up more than once somehow?

Ok then I assume it might be the apigw/lambda retry mechanism... will check it.

pharindoko commented 3 months ago

I assume that this really only happens when the apigw (my internal customlambdaaccess implementation) does a retry...

seems like this occurs for 1 % of the runner requests and it`s always the same correlation between "the stepfunction alreadyexists message" <> a 502 bad gateway error <> a queued job waiting for a runner in github

kichik commented 3 months ago

Can you tell what's the initial failure? It makes sense for the second webhook retry to fail with "the stepfunction alreadyexists message" and return 502. But why would the first one fail and require a retry? Is it a timeout maybe?

pharindoko commented 3 months ago

you`re right should investigate in this :) will do ... timeout could be an option yes

CloudSnorkel / cdk-github-runners

feat: add additional labels #472

Another option

On the right side I query github in a lambda authenticated as github app to get all the organisation installations