Open joerawr opened 5 months ago
Did this happen after upgrading to 7.7 or did you set up a new runner from scratch?
Great question! I should have addressed this.
Fresh install. I have 3 isolated installs running, each with the same behavior.
We are migrating away from 6.5.2, where that was configured as a single instance attached at the group level (7 groups, so 7 isolated runners). We've been using the Cattle Ops terraform for almost 2 years, but this is our first try at using it with autoscaling.
So the difference between the installations is the autoscaling, right? Everything else remained the same, especially the images used for the runner machine and the workers?
VPCs. IAM and security groups stayed the same. Other than that, quite different.
For our 6.5.x runners, we are using Method 3 from the README:
- GitLab Ci docker runner
In this scenario not docker machine is used but docker to schedule the builds. Builds will run on the same EC2 instance as the agent. No auto-scaling is supported.
So we only have the Amazon Linux instance.
For 7.7.0 we started with examples/runner-pre-registered/main.tf , and commented out all the network and security groups, since those are already defined.
The issue we are seeing is the runner manager will spin up as many Ubuntu worker instances as we specify in IdleCountMin, but no more than that.
IdleScaleFactor = 1.0
idle_count = 10
IdleCountMin = 2
As a test I specified in the main.tf to use ubuntu 18.04 and now docker-machine ls
no longer has errors, and we can see that docker 24 is installed:
runner_worker_docker_machine_ami_filter = {
name = ["ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-*"]
}
# docker-machine ls
NAME ACTIVE DRIVER STATE URL SWARM DOCKER ERRORS
runner-dibjhdrwq-sre-runner-1717450544-2d6579f7 - amazonec2 Running tcp://10.0.0.143:2376 v24.0.2
runner-dibjhdrwq-sre-runner-1717450544-f4fa850e - amazonec2 Running tcp://10.0.0.11:2376 v24.0.2
However, I still don't see auto-scaling. Just the number of idle runners. That eliminates my thought that it was docker-machine errors causing the lack of autoscaling.
How does the runner manager determine when to spin up new workers?
I attached a snippet of the logs, and we can see jobs in queue for 500 and 700 seconds, and gitlab-runner "Using existing docker-machine": runner-logs.txt
I made progress on this. The max_jobs
variable was limiting the number of Runners the manager would spin up.
from the module's variables.tf
variable "runner_worker" { description = <<-EOT For detailed information, check https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runners-section. environment_variables = List of environment variables to add to the Runner Worker (environment). max_jobs = Number of jobs which can be processed in parallel by the Runner Worker.
I had max_jobs set to 2 and what I saw was only two runners and two jobs would be handled at a time for a pipeline. Changing this to zero, and now the runners autoscale. 12 jobs in parallel trigger trigger creation of 12 runners.
I am confused by the wording here in the variables.tf file. This setting seems to limit the number of workers that are spun up when autoscaling is used. This max_jobs
value seems to map back to the limit
setting in the Gitlab docker-machine docs:
"Limit how many jobs can be handled concurrently by this registered runner. 0 (default) means do not limit."
This is what I wanted, to limit each runner to run 2 jobs in parallel.
However: "View how this setting works with the Docker Machine executor (for autoscaling)." makes it clear the behavior I have been seeing:
To limit the number of virtual machines (VMs) created by the Docker Machine executor, use the limit parameter in the [[runners]] section of the config.toml file.
Perhaps the description of max_jobs can be clarified in the variables.tf file?
Next I am trying to understand the difference between idle_count in these two settings:
runner_worker_docker_machine_instance = { idle_count = 3 # idle_count = Number of idle Runner Worker instances (not working for the Docker Runner Worker) (IdleCount). }
and runner_worker_docker_machine_autoscaling_options = [ { idle_count = 0 periods = [" *"] }
Seems that idle_count in runner_worker_docker_machine_autoscaling_options is the one that controls how many idle runners, yes?
We're seeing the same error message in docker-machine ls
. Our runner is able to scale up and down. I can see the number of instances changing in the EC2 console. But, the runner is printing out these errors:
{"level":"error","msg":"UnsupportedOperation: You can't stop the Spot Instance 'i-Y' because it is associated with a one-time Spot Instance request. You can only stop Spot Instances associated with persistent Spot Instance requests.","name":"runner-X","operation":"stop","time":"2024-08-07T22:13:35Z"}
{"level":"error","msg":"\tstatus code: 400, request id: 69f08609-a312-4342-88ed-ebd35e30f97c","name":"runner-X","operation":"stop","time":"2024-08-07T22:13:35Z"}
{"error":"read |0: file already closed","level":"warning","msg":"Problem while reading command output","time":"2024-08-07T22:13:35Z"}
{"error":"exit status 1","level":"warning","lifetime":1365297031454,"msg":"Error while stopping machine","name":"runner-X","reason":"too many idle machines","time":"2024-08-07T22:13:35Z","used":336125,"usedCount":5}
Is it related to this issue?
How does docker-machine install docker on the new nodes? Can we control whether it installs a version that is incompatible (27.1.1 in our case)?
@LTegtmeier Do you use your own AMIs for the Runner and the Workers? Can you try the default ones?
I think, docker machine is using a SSH connection to the worker and installs all per-requisites before starting the pipeline.
Can you try the default ones?
We use the default filters for both types with version 7.9.0 of the module.
Looking again, the default changed since I last copied it into a parameter. I did that to make it easier to toggle between AMD and ARM. These runners are AMD and the runner AMI uses the older amzn2-ami-hvm-2.*-arm64-gp2
filter. The worker filter is the default ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*
.
We end up with ami-0a2dd45de938754ee
for the runner and ami-0f2175c525a037449
for the workers.
is using a SSH connection to the worker and installs all per-requisites before starting the pipeline.
That's what it looks like. If I start the AMI on a test EC2, Docker isn't installed at all. Docker Machine correctly installs it but, at a version that's not fully compatible with the docker machine version.
We never solved this issue with Docker Machine or understood the root cause. We moved to the docker-autoscaler
executor.
@LTegtmeier I have fleeting enabled here. Could you please try with these AMIs?
Runner: ami-00f07845aed8c0ee7
Worker: ami-02c93b9f4cd7656e4
The module configuration looks pretty straight forward (some of the options removed):
runner_worker_docker_machine_fleet = {
enable = true
}
runner_worker_docker_machine_instance = {
name_prefix = "${var.runner_settings.runner_name}-${each.value.availability_zone}"
types = var.runner_settings.worker_instance_types
}
runner_worker = {
ssm_access = true
request_concurrency = 1
type = "docker+machine"
}
Describe the bug
version = "7.7.0"
I am not seeing new runners autoscale when jobs are queued up. And
docker-machine ls
has errors:Also I see docker 26 on the runner, regardless of what version is specified in the terraform and what shows in the config.toml
It says minimum version 1.24, which is the same error
docker-machine ls
sees: "Minimum supported API version is 1.24"I might be misunderstanding how this works, but I think the mis-match in docker api version means it can't be detected that the runner is busy, so the new runners aren't spun up. Am I close?
To Reproduce
Terraform apply
Expected behavior
I have a test pipeline that spins up 12 jobs in parallel, and I only see 2 runners spun up with idle=1, 2 jobs per runner. Only 2 jobs are running and the others are queued.
I expect 5-7 runners to spin up to pick up the jobs.
Additional context
We can see default Ubuntu 20.04 is installing docker 26 in the ubuntu runner logs:
I tried specifying docker 20, 24, and removing the variable so the default 18 would install, but I don't see it happening.
docker_version = "public.ecr.aws/docker/library/docker:20"
Terraform values:
config.toml
Let me know any more info to supply and what else I can try.