cattle-ops / terraform-aws-gitlab-runner

Terraform module for AWS GitLab runners on ec2 (spot) instances
https://registry.terraform.io/modules/cattle-ops/gitlab-runner/aws
MIT License
587 stars 331 forks source link

Spot Fleet doesn't work as expected #1097

Closed Leonidimus closed 8 months ago

Leonidimus commented 8 months ago

Describe the bug

I list multiple EC2 instance types for Spot Fleet workers, but only one instance type used to generate spot requests. The spot request type is "instance" but I believe it should be of "fleet" type to be able to request multiple instance types. I've tried with docker_machine_version = "0.16.2-gitlab.19-cki.2" and 0.16.2-gitlab.19-cki.4 - same result.

To Reproduce

Configure the module similar to Scenario: Use of Spot Fleet from documentation, specify several instance types. Observe the same instance type launched for all jobs.

Expected behavior

AWS spot requests created should be of "fleet" type with multiple EC2 instance types.

Configuration used

My terraform config terraform: ``` module "gitlab-runner" { source = "npalm/gitlab-runner/aws" version = "v7.3.1" environment = var.gitlab_environment vpc_id = var.aws_vpc_id subnet_id = element(var.aws_private_subnets, 0) runner_cloudwatch = { enable = true retention_days = 60 } runner_gitlab = { url = var.gitlab_url } runner_gitlab_registration_config = { registration_token = var.gitlab_registration_token tag_list = var.gitlab_tags description = var.runners_description locked_to_project = "true" run_untagged = "false" maximum_timeout = "7200" } runner_instance = { name = var.runners_name type = "t3a.large" ssm_access = true root_device_config = { volume_size = 50 # GiB } } runner_install = { amazon_ecr_credential_helper = true docker_machine_version = "0.16.2-gitlab.19-cki.2" } runner_worker = { type = "docker+machine" ssm_access = true } runner_worker_docker_machine_fleet = { enable = true } runner_worker_docker_machine_instance = { types = ["t3a.large", "t3.large", "m5a.large", "m5.large", "m6a.large"] subnet_ids = var.aws_private_subnets start_script = file("${path.module}/worker_userdata.sh") volume_type = "gp3" root_size = 50 } runner_worker_docker_options = { privileged = true volumes = [ "/var/run/docker.sock:/var/run/docker.sock", "/gitlab-runner/docker:/root/.docker", "/gitlab-runner/ssh:/root/.ssh:ro", "/root/.pypirc:/root/.pypirc", "/root/.npmrc:/root/.npmrc" ] } } ```
Tiduster commented 8 months ago

Hello @Leonidimus

I understand what you are asking, but this is not how this module works. Let me explain.

If you want to see what the code do, you can check in your CloudTrail > Event history and look for CreateFleet events.

You will see something like this:

    "eventTime": "2024-03-13T18:50:12Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "CreateFleet",
    "awsRegion": "eu-west-3",
    "sourceIPAddress": ""***********:",
    "userAgent": "aws-sdk-go/1.44.153 (go1.12.9; linux; amd64)",
    "requestParameters": {
        "CreateFleetRequest": {
            "TargetCapacitySpecification": {
                "DefaultTargetCapacityType": "spot",
                "TotalTargetCapacity": 1
            },
            "Type": "instant",
            "SpotOptions": {
                "AllocationStrategy": "price-capacity-optimized",
                "MaxTotalPrice": "0.50"
            },

The important part is "TotalTargetCapacity": 1 and "Type": "instant". It means that you want 1 instance, and the Fleet is destroyed after the instance is created. "AllocationStrategy": "price-capacity-optimized" means that you want the best price with the best capacity.

In the launchTemplate configuration, you will see your choice of instance types:

"LaunchTemplateConfigs": {
                "LaunchTemplateSpecification": {
                    "LaunchTemplateName": "gitlab-runner-dev-shr-small-ai-worker-20230510162620868300000001",
                    "Version": "$Latest"
                },
                "Overrides": [
                    {
                        "tag": 1,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3a.medium"
                    },
                    {
                        "tag": 2,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3a.medium"
                    },
                    {
                        "tag": 3,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3a.medium"
                    },
                    {
                        "tag": 4,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3.medium"
                    },
                    {
                        "tag": 5,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3.medium"
                    },
                    {
                        "tag": 6,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3.medium"
                    },
                    {
                        "tag": 7,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c5a.large"
                    },
                    {
                        "tag": 8,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c5a.large"
                    },
                    {
                        "tag": 9,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c5a.large"
                    },
                    {
                        "tag": 10,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3.large"
                    },
                    {
                        "tag": 11,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3.large"
                    },
                    {
                        "tag": 12,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3.large"
                    },
                    {
                        "tag": 13,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c6a.large"
                    },
                    {
                        "tag": 14,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c6a.large"
                    },
                    {
                        "tag": 15,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c6a.large"
                    },
                    {
                        "tag": 16,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c5d.large"
                    },
                    {
                        "tag": 17,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c5d.large"
                    },
                    {
                        "tag": 18,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c5d.large"
                    }
                ],

In my configuration we use 6 different types of instances, in 3 AZs.

The Fleet will always launch the same kind of instances, because it thinks it's the best right know in term of price and capacity. If the capacity become low, it will switch automatically the type or AZ while launching the next docker+machine requested.

This is NOT perfect, because I am sure you may want to spread the instances right from the start with multiple types, to reduce the chance of multiple instances being retaken at the same time. Unfortunately, this is not how the software was developed, and we are limited by the VERY OLD and deprecated docker+machine code base :-) .

In any way, we use this feature for our production runner fleet, launching 10k+ jobs per day for over 200+ developers, and this is running like a charm on eu-west-3, with very few availability incidents.

Best regards,

Do not hesitate if you have any additional questions.

You may also want to improve cki codebase if you have some ideas, I will be very happy to test any new release in our setup.

kayman-mk commented 8 months ago

@Tiduster Thanks for explaining that.