cattle-ops / terraform-aws-gitlab-runner

Terraform module for AWS GitLab runners on ec2 (spot) instances
https://registry.terraform.io/modules/cattle-ops/gitlab-runner/aws
MIT License
587 stars 331 forks source link

Runners not scaling, and "Unable to query docker version" #1134

Open joerawr opened 5 months ago

joerawr commented 5 months ago

Describe the bug

version = "7.7.0"

I am not seeing new runners autoscale when jobs are queued up. And docker-machine ls has errors:

docker-machine ls
NAME                                              ACTIVE   DRIVER      STATE     URL                        SWARM   DOCKER    ERRORS
runner-dibjhdrwq-sre-runner-1717284911-0ac9e80b   -        amazonec2   Running   tcp://10.0.0.168:2376           Unknown   Unable to query docker version: 400 Bad Request: {"message":"client version 1.15 is too old. Minimum supported API version is 1.24, please upgrade your client to a newer version"}

runner-dibjhdrwq-sre-runner-1717286598-360c96b9   -    amazonec2   Running   tcp://10.0.0.167:2376        Unknown   Unable to query docker version: 400 Bad Request: {"message":"client version 1.15is too old. Minimum supported API version is 1.24, please upgrade your client to a newer version"}

Also I see docker 26 on the runner, regardless of what version is specified in the terraform and what shows in the config.toml

root@runner-dibjhdrwq-sre-runner-1717284911-0ac9e80b:/var/log# docker version
Client: Docker Engine - Community
 Version:           26.1.3
 API version:       1.45
 Go version:        go1.21.10
 Git commit:        b72abbb
 Built:             Thu May 16 08:33:49 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          26.1.3
  API version:      1.45 (minimum version 1.24)

It says minimum version 1.24, which is the same error docker-machine ls sees: "Minimum supported API version is 1.24"

I might be misunderstanding how this works, but I think the mis-match in docker api version means it can't be detected that the runner is busy, so the new runners aren't spun up. Am I close?

To Reproduce

Terraform apply

Expected behavior

I have a test pipeline that spins up 12 jobs in parallel, and I only see 2 runners spun up with idle=1, 2 jobs per runner. Only 2 jobs are running and the others are queued.

I expect 5-7 runners to spin up to pick up the jobs.

Additional context

We can see default Ubuntu 20.04 is installing docker 26 in the ubuntu runner logs:

Start-Date: 2024-06-01  23:36:15
Commandline: apt-get install -y -qq docker-ce docker-ce-cli containerd.io docker-compose-plugin docker-ce-rootless-extras docker-buildx-plugin
Requested-By: ubuntu (1000)
Install: slirp4netns:amd64 (0.4.3-1, automatic), containerd.io:amd64 (1.6.32-1), docker-ce-rootless-extras:amd64 (5:26.1.3-1~ubuntu.20.04~focal), docker-buildx-plugin:amd64 (0.14.0-1~ubuntu.20.04~focal), pigz:amd64 (2.4-1, automatic), docker-compose-plugin:amd64 (2.27.0-1~ubuntu.20.04~focal), docker-ce:amd64 (5:26.1.3-1~ubuntu.20.04~focal), docker-ce-cli:amd64 (5:26.1.3-1~ubuntu.20.04~focal)
End-

I tried specifying docker 20, 24, and removing the variable so the default 18 would install, but I don't see it happening.

docker_version = "public.ecr.aws/docker/library/docker:20"

Terraform values:

variable "max_jobs" {
  description = "Number of jobs which can be processed in parallel by the Runner Worker."
  type        = string
  default     = "2"
}

variable "autoscaling_periods" {
  description = "A list of strings representing the periods when the autoscaling should be active."
  type        = list(string)
  default     = ["* * * * * * *"]
}

variable "autoscaling_idle_count" {
  description = "The number of idle runners to keep before scaling down."
  type        = number
  default     = 1
}

variable "autoscaling_idle_scale_factor" {
  description = "The factor by which to scale down the number of runners when idle."
  type        = number
  default     = 1.0
}

variable "autoscaling_idle_count_min" {
  description = "The minimum number of idle runners to keep."
  type        = number
  default     = 1
}

variable "autoscaling_idle_time" {
  description = "The amount of time a runner can be idle before being considered for scaling down."
  type        = number
  default     = 600

config.toml

    [runners.docker]
    disable_cache = false
    image = "public.ecr.aws/docker/library/docker:20"
    privileged = true
    pull_policy = ["always"]
    shm_size = 0
    tls_verify = false
    volumes = ["/cache","/certs/client"]

  [runners.machine]
    IdleCount = 0
    IdleTime = 600

    MachineDriver = "amazonec2"
    MachineOptions = [
     <redacted>
      "amazonec2-request-spot-instance=false",
      ,"amazonec2-metadata-token=required", "amazonec2-metadata-token-response-hop-limit=2",
    ]
    MaxGrowthRate = 5
    [[runners.machine.autoscaling]]
      IdleCount = 10
      IdleCountMin = 1
      IdleTime = 600
      Periods = ["* * * * * * *"]
      Timezone = "America/Los_Angeles"

Let me know any more info to supply and what else I can try.

kayman-mk commented 5 months ago

Did this happen after upgrading to 7.7 or did you set up a new runner from scratch?

joerawr commented 5 months ago

Great question! I should have addressed this.

Fresh install. I have 3 isolated installs running, each with the same behavior.

We are migrating away from 6.5.2, where that was configured as a single instance attached at the group level (7 groups, so 7 isolated runners). We've been using the Cattle Ops terraform for almost 2 years, but this is our first try at using it with autoscaling.

kayman-mk commented 5 months ago

So the difference between the installations is the autoscaling, right? Everything else remained the same, especially the images used for the runner machine and the workers?

joerawr commented 5 months ago

VPCs. IAM and security groups stayed the same. Other than that, quite different.

For our 6.5.x runners, we are using Method 3 from the README:

  1. GitLab Ci docker runner

In this scenario not docker machine is used but docker to schedule the builds. Builds will run on the same EC2 instance as the agent. No auto-scaling is supported.

So we only have the Amazon Linux instance.

For 7.7.0 we started with examples/runner-pre-registered/main.tf , and commented out all the network and security groups, since those are already defined.

The issue we are seeing is the runner manager will spin up as many Ubuntu worker instances as we specify in IdleCountMin, but no more than that.

      IdleScaleFactor = 1.0
      idle_count = 10
      IdleCountMin = 2

As a test I specified in the main.tf to use ubuntu 18.04 and now docker-machine lsno longer has errors, and we can see that docker 24 is installed:

  runner_worker_docker_machine_ami_filter = {
    name = ["ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-*"]
  }
# docker-machine ls
NAME                                              ACTIVE   DRIVER      STATE     URL                        SWARM   DOCKER    ERRORS
runner-dibjhdrwq-sre-runner-1717450544-2d6579f7   -        amazonec2   Running   tcp://10.0.0.143:2376           v24.0.2
runner-dibjhdrwq-sre-runner-1717450544-f4fa850e   -        amazonec2   Running   tcp://10.0.0.11:2376            v24.0.2

However, I still don't see auto-scaling. Just the number of idle runners. That eliminates my thought that it was docker-machine errors causing the lack of autoscaling.

How does the runner manager determine when to spin up new workers?

I attached a snippet of the logs, and we can see jobs in queue for 500 and 700 seconds, and gitlab-runner "Using existing docker-machine": runner-logs.txt

joerawr commented 5 months ago

I made progress on this. The max_jobs variable was limiting the number of Runners the manager would spin up.

from the module's variables.tf

variable "runner_worker" { description = <<-EOT For detailed information, check https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runners-section. environment_variables = List of environment variables to add to the Runner Worker (environment). max_jobs = Number of jobs which can be processed in parallel by the Runner Worker.

I had max_jobs set to 2 and what I saw was only two runners and two jobs would be handled at a time for a pipeline. Changing this to zero, and now the runners autoscale. 12 jobs in parallel trigger trigger creation of 12 runners.

I am confused by the wording here in the variables.tf file. This setting seems to limit the number of workers that are spun up when autoscaling is used. This max_jobs value seems to map back to the limit setting in the Gitlab docker-machine docs:

"Limit how many jobs can be handled concurrently by this registered runner. 0 (default) means do not limit."

This is what I wanted, to limit each runner to run 2 jobs in parallel.

However: "View how this setting works with the Docker Machine executor (for autoscaling)." makes it clear the behavior I have been seeing:

To limit the number of virtual machines (VMs) created by the Docker Machine executor, use the limit parameter in the [[runners]] section of the config.toml file.

Perhaps the description of max_jobs can be clarified in the variables.tf file?

Next I am trying to understand the difference between idle_count in these two settings:

runner_worker_docker_machine_instance = { idle_count = 3 # idle_count = Number of idle Runner Worker instances (not working for the Docker Runner Worker) (IdleCount). }

and runner_worker_docker_machine_autoscaling_options = [ { idle_count = 0 periods = [" *"] }

Seems that idle_count in runner_worker_docker_machine_autoscaling_options is the one that controls how many idle runners, yes?

LTegtmeier commented 3 months ago

We're seeing the same error message in docker-machine ls. Our runner is able to scale up and down. I can see the number of instances changing in the EC2 console. But, the runner is printing out these errors:

{"level":"error","msg":"UnsupportedOperation: You can't stop the Spot Instance 'i-Y' because it is associated with a one-time Spot Instance request. You can only stop Spot Instances associated with persistent Spot Instance requests.","name":"runner-X","operation":"stop","time":"2024-08-07T22:13:35Z"}
{"level":"error","msg":"\tstatus code: 400, request id: 69f08609-a312-4342-88ed-ebd35e30f97c","name":"runner-X","operation":"stop","time":"2024-08-07T22:13:35Z"}
{"error":"read |0: file already closed","level":"warning","msg":"Problem while reading command output","time":"2024-08-07T22:13:35Z"}
{"error":"exit status 1","level":"warning","lifetime":1365297031454,"msg":"Error while stopping machine","name":"runner-X","reason":"too many idle machines","time":"2024-08-07T22:13:35Z","used":336125,"usedCount":5}

Is it related to this issue?

How does docker-machine install docker on the new nodes? Can we control whether it installs a version that is incompatible (27.1.1 in our case)?

kayman-mk commented 3 months ago

@LTegtmeier Do you use your own AMIs for the Runner and the Workers? Can you try the default ones?

I think, docker machine is using a SSH connection to the worker and installs all per-requisites before starting the pipeline.

LTegtmeier commented 3 months ago

Can you try the default ones?

We use the default filters for both types with version 7.9.0 of the module.

Looking again, the default changed since I last copied it into a parameter. I did that to make it easier to toggle between AMD and ARM. These runners are AMD and the runner AMI uses the older amzn2-ami-hvm-2.*-arm64-gp2 filter. The worker filter is the default ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*.

We end up with ami-0a2dd45de938754ee for the runner and ami-0f2175c525a037449 for the workers.

is using a SSH connection to the worker and installs all per-requisites before starting the pipeline.

That's what it looks like. If I start the AMI on a test EC2, Docker isn't installed at all. Docker Machine correctly installs it but, at a version that's not fully compatible with the docker machine version.

LTegtmeier commented 1 month ago

We never solved this issue with Docker Machine or understood the root cause. We moved to the docker-autoscaler executor.

kayman-mk commented 1 month ago

@LTegtmeier I have fleeting enabled here. Could you please try with these AMIs?

Runner: ami-00f07845aed8c0ee7 Worker: ami-02c93b9f4cd7656e4

The module configuration looks pretty straight forward (some of the options removed):

  runner_worker_docker_machine_fleet = {
    enable = true
  }

  runner_worker_docker_machine_instance = {
    name_prefix              = "${var.runner_settings.runner_name}-${each.value.availability_zone}"
    types                    = var.runner_settings.worker_instance_types
  }

  runner_worker = {
    ssm_access            = true
    request_concurrency   = 1
    type                  = "docker+machine"
  }