Bug: multi-labels don`t work as expected - Githubissues

CloudSnorkel / cdk-github-runners

CDK constructs for self-hosted GitHub Actions runners

https://constructs.dev/packages/@cloudsnorkel/cdk-github-runners/

Apache License 2.0

255 stars 37 forks source link

Bug: multi-labels don`t work as expected #181

Closed pharindoko closed 1 year ago

pharindoko commented 1 year ago

I added the ec2 runner for test ...

cdk code:

    const cloudPlatformsEc2Runner = new Ec2Runner(
      this,
      `${props.serviceName}-${props.stage}-ec2-runner`,
      {
        labels: ["cloudplatforms", "ec2"],
        vpc: existingVpc,
        subnetSelection: vpcSubnets,
        spot: true,
        instanceType: ec2.InstanceType.of(ec2.InstanceClass.T3, ec2.InstanceSize.MICRO),
      }
    );

workflow.yml:

name: fargate example
on: workflow_dispatch
jobs:
  self-hosted:
    runs-on: [self-hosted,cloudplatforms,ec2]
    steps:
      - run: echo hello world

stepfunctions:

these labels have been transmitted:

  "labels": {
    "self-hosted": true,
    "cloudplatforms": true,
    "ec2": true
  },

result:

it`s taking the wrong label :O

I assume because of this weird parent label

pharindoko commented 1 year ago

ok, fine I figured out that the order of the runners is important ... like in a classic if, else if, else order ...

pharindoko commented 1 year ago

@kichik Would be nice if the state machine checks exactly for the given labels. Maybe you can additionally check if the count of the labels fit.

kichik commented 1 year ago

I need to think about this one. Not sure if people would want to include labels that are not taken into consideration to carry some meta data with them.

pharindoko commented 1 year ago

Ok having two different github runner stacks running (one prod, one experimental) this leads to confusion. On each app a runner will be started when one of the same tags is used in both GitHubRunners stacks.

kichik commented 1 year ago

Can you please include the labels both runner stacks are waiting for and the labels used by the job? I'm not sure I understand how this specific issue will cause two runners to start.

Either way, I'm still not 100% sure about this issue. How about we start with a flag that's off by default? The webhook function would have to generate a string of sorted labels joined by a comma. It should get an environment variable flag switching between the old and new behavior. Or maybe just always send both in two different fields.

https://github.com/CloudSnorkel/cdk-github-runners/blob/29708ebe9a83cf84566a7073cda60dd6bbff402a/src/lambdas/webhook-handler/index.ts#L92-L94

Then our comparison below can be labels == "self-hosted,cloudplatforms,ec2".

https://github.com/CloudSnorkel/cdk-github-runners/blob/29708ebe9a83cf84566a7073cda60dd6bbff402a/src/runner.ts#L295-L302

kichik commented 1 year ago

As I was writing the documentation for this feature, I realized why this is a bug bug. The provider took a job that it cannot actually execute. The job will never be assigned to the ["cloudplatforms"] provider. The job requires ["cloudplatform", "ec2"]. So the provider will start an unused runner.

I still feel like I'm missing something, so I'll open the PR and leave it open for a few days.

Update: finally found one of the issues. Built-in runner labels like windows/linux.

Update 2: Not sure how to work around OS/architecture labels. If we require exact label match, then all users now have to always specify e.g. [label, linux, x64, self-hosted]. Either that or we strip those from the request but then the user can't specify providers with them either (like the integration test does).

kichik commented 1 year ago

I think I will end up adding a warning if labels collide this way. I can't think of a way to make this work in a way that will also be pleasant for the user. The only solution I can think of is forcing the user to always specify all the labels including OS and architecture.

kichik commented 1 year ago

Next version should have a warning (and an error for duplicate labels). If anyone could think of a better way, please re-open this ticket.