CloudSnorkel / cdk-github-runners

CDK constructs for self-hosted GitHub Actions runners
https://constructs.dev/packages/@cloudsnorkel/cdk-github-runners/
Apache License 2.0
261 stars 36 forks source link

Issue with ECS Provider #406

Closed excavador closed 10 months ago

excavador commented 10 months ago

Hello!

I need your advise. I am not able to launch ECS-based runners, and I do not understand what is bad and how to make it fly Lambda + EC2 runners works fine.

So, this is my code.

export interface GitHubRunnersStackProps extends cdk.StackProps {
    readonly ec2ImageBuilder?: run.RunnerImageBuilderProps;
    readonly ec2?: run.Ec2RunnerProviderProps;
    readonly ecsImageBuilder?: run.RunnerImageBuilderProps;
    readonly ecs?: run.EcsRunnerProviderProps;
    readonly lambdaImageBuilder?: run.RunnerImageBuilderProps;
    readonly lambda?: run.LambdaRunnerProviderProps;
    readonly runners: run.GitHubRunnersProps;
}

export class GitHubRunnersStack extends cdk.Stack {
    private readonly imageBuilders: run.RunnerImageBuilder[];

    // we need these dependencies to avoid error
    // "Cannot have more than 1 builds in queue for the account"
    private addImageBuilder(builder: run.RunnerImageBuilder) {
        let last: run.RunnerImageBuilder | undefined;
        if (this.imageBuilders.length > 0) {
            last = this.imageBuilders[this.imageBuilders.length - 1];
            builder.node.addDependency(last);
        }
        this.imageBuilders.push(builder);
    }

    constructor(scope: Construct, id: string, props: GitHubRunnersStackProps) {
        super(scope, id, props);
        iam.PermissionsBoundary.of(this).apply(
            iam.ManagedPolicy.fromManagedPolicyName(this, 'PermissionsBoundary', 'pb@read-only-iam'),
        );
        const providers: run.IRunnerProvider[] = [];
        providers;
        this.imageBuilders = [];
        if (props.ec2 !== undefined) {
            const current = new Construct(this, 'EC2');
            const builder = run.Ec2RunnerProvider.imageBuilder(current, 'Builder', props.ec2ImageBuilder);
            this.addImageBuilder(builder);
            // force to build image
            builder.bindAmi();
            const provider = new run.Ec2RunnerProvider(current, 'Provider', {
                imageBuilder: builder,
                ...props.ec2,
            });
            providers.push(provider);
        }
        if (props.ecs !== undefined && props.ecsAutoScalingGroup !== undefined) {
            const current = new Construct(this, 'ECS');
            const builder = run.EcsRunnerProvider.imageBuilder(current, 'Builder', props.ecsImageBuilder);
            this.addImageBuilder(builder);
            // force to build image
            builder.bindDockerImage();

            const provider = new run.EcsRunnerProvider(current, 'Provider', {
                imageBuilder: builder,
                ...props.ecs,
            });
            providers.push(provider);
        }
        if (props.lambda !== undefined) {
            const current = new Construct(this, 'Lambda');
            const builder = run.LambdaRunnerProvider.imageBuilder(current, 'Builder', props.lambdaImageBuilder);
            this.addImageBuilder(builder);
            // force to build image
            builder.bindDockerImage();
            const provider = new run.LambdaRunnerProvider(current, 'Provider', {
                imageBuilder: builder,
                ...props.lambda,
            });
            providers.push(provider);
        }
        // new run.GitHubRunners(this, 'GitHubRunners', {
        //     providers,
        //     ...props.runners,
        // });
    }
}

export function provision() {
    const app = new cdk.App();

    const vpc = new VpcStack(app, 'Vpc', {
        cidr: '10.172.0.0/16',
        maxAzs: 3,
        natGateways: 1,
        subnets: [
            {
                name: 'Public01',
                type: ec2.SubnetType.PUBLIC,
                cidrMask: 24,
            },
            {
                name: 'Private01',
                type: ec2.SubnetType.PRIVATE_WITH_EGRESS,
                cidrMask: 24,
            },
            {
                name: 'Private02',
                type: ec2.SubnetType.PRIVATE_WITH_EGRESS,
                cidrMask: 24,
            },
        ],
    });
    new GitHubRunnersStack(app, 'GitHubRunners', {
        ec2ImageBuilder: {
            vpc: vpc.vpc,
            os: run.Os.LINUX_UBUNTU,
        },
        ec2: {
            labels: ['ec2'],
            vpc: vpc.vpc,
            subnetSelection: {
                subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
            },
        },
        ecsImageBuilder: {
            vpc: vpc.vpc,
            os: run.Os.LINUX_UBUNTU,
        },
        ecs: {
            labels: ['ecs'],
            vpc: vpc.vpc,
            subnetSelection: {
                subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
            },
        },
        lambdaImageBuilder: {
            vpc: vpc.vpc,
            os: run.Os.LINUX_AMAZON_2,
        },
        lambda: {
            labels: ['lambda'],
            vpc: vpc.vpc,
            subnetSelection: {
                subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
            },
            memorySize: 128,
            ephemeralStorageSize: cdk.Size.gibibytes(1),
        },
        runners: {
            vpc: vpc.vpc,
        },
    });
}

if (typeof require !== 'undefined' && require.main === module) {
    provision();
}

I am receiving error like the following

11:21:55 AM | CREATE_FAILED        | AWS::ECS::CapacityProvider                     | ECS/Provider/Capac.../Capacity Provider
Resource handler returned message: "Service Unavailable. Please try again later. (Service: AmazonECS; Status Code: 500; Error Code: ServerException; Request ID: 53fc7f34-b5f4-4240-901c-06e7c4e105b9; Proxy
: null)" (RequestToken: 5371e915-32ac-b42d-3bd5-3bce080a3a14, HandlerErrorCode: GeneralServiceException)

I tried the several tricks, like the following

            //
            // https://stackoverflow.com/questions/70597037/terraform-ecs-createcapacityprovider-request-500
            //
            // Signal to CloudFormation that the instance is up
            const asgId = 'EcsProviderAutoScalingGroup';
            const userData = ec2.UserData.forLinux();
            userData.addCommands(
            'PS4="$(date -u +"%Y-%m-%dT%H:%M:%SZ") [INFO] "',
            'set -x', // echo everything
            '(',
            'yum install -y aws-cfn-bootstrap',
            `/opt/aws/bin/cfn-signal -e 0 --stack ${this.stackName} --resource ${asgId} --region ${this.region}`,
            '{ set +x; } 2>/dev/null',
            ') |& tee --append /var/log/ecs/user-data.log'
            );
            const autoScalingGroup = new autoscaling.AutoScalingGroup(current, 'AutoScalingGroup', props.ecsAutoScalingGroup)
            const cfnAsg = autoScalingGroup.node.defaultChild as autoscaling.CfnAutoScalingGroup;
            cfnAsg.overrideLogicalId(asgId);
            cfnAsg.autoScalingGroupName = asgId;

            const capacityProvider = new ecs.AsgCapacityProvider(current, 'CapacityProvider', {
                machineImageType: ecs.MachineImageType.AMAZON_LINUX_2,
                autoScalingGroup,
            })
            capacityProvider;

            const provider = new run.EcsRunnerProvider(current, 'Provider', {
                imageBuilder: builder,
                capacityProvider,
                ...props.ecs,
            });
            providers.push(provider);

..but it does not work.

AWS Documentation/Forms are telling, you could be some missconfiguration for ECS tasks.

So, do you have any idea, how to launch it and make workable?

excavador commented 10 months ago

Just in case - checked without permissions boundaries, the issue persist. It sounds like some pretty basic problem inside code or maybe cdk-github-runners.

I am testing it on eu-west-1 just in case

kichik commented 10 months ago

Have you tried with no parameters at all? Let it create the VPC for you with the dedaults and everything else too

excavador commented 10 months ago

I will try. Thank you

excavador commented 10 months ago

@kichik I definitely missed the VPC endpoints in private subnets.

The example from root "README.md" is confusing (not workable) and better to remove or adjust it.

I have the another problem now, with ServiceLinkedRoles. This particular issue could be considered as README.md bug

kichik commented 10 months ago

@kichik I definitely missed the VPC endpoints in private subnets.

Your original code uses private subnets with egress. Why would VPC Endpoint be required in that case?

The example from root "README.md" is confusing (not workable) and better to remove or adjust it.

Can you please point to the specific example?

I have the another problem now, with ServiceLinkedRoles. This particular issue could be considered as README.md bug

You mean creating the service linked role for ECS?