feat: add docker autoscaler executor #1118

Closed mmoutama09 closed 3 months ago

Description

Provides a new executor using the new GitLab autoscaler executor. I've been using the fleeting plugin for AWS only.

Prerequisite: Docker must already be installed on the AMI used by worker machines (the Docker autoscaler does not install it, unlike the Docker machine). Additionally, the user used to connect to the workers must also be added to the Docker group.

Related to issue https://github.com/cattle-ops/terraform-aws-gitlab-runner/issues/624

Verification

Built an AMI with Docker based on Amazon Linux 2023. Set up the new executor according to the example. Works!

Hey @mmoutama09! 👋

Thank you for your contribution to the project. Please refer to the contribution rules for a quick overview of the process.

Make sure that this PR clearly explains:

the problem being solved
the best way a reviewer and you can test your changes

With submitting this PR you confirm that you hold the rights of the code added and agree that it will published under this LICENSE.

The following ChatOps commands are supported:

/help: notifies a maintainer to help you out

Simply add a comment with the command in the first line. If you need to pass more information, separate it with a blank line from the command.

This message was generated automatically. You are welcome to improve it.

Hi @kayman-mk. I am a colleague of @mmoutama09 and @Kadeux.

This change could be the next major release of the module.

Gitlab is still on track to make their plugin GA this summer: https://gitlab.com/groups/gitlab-org/-/epics/6995

We are still NOT using this version in our production setup, but we will deploy it on part of our runners in June.

What should be the next steps for this PR?

Best regards,

Sounds quite promising to get rid of the outdated docker machine. As soon as GitLab has published their module, we can integrate it here.

As far as I can see the docker machine can still be used, so we can create a feature release. Before the next major release I will check if we can get rid of docker machine to simplify the code.

Could you please post the settings to test this change?

At the moment I am working on #1117. That change will be merged before to support zero downtime during deployment of a new version.

Thanks @kayman-mk for your answer.

I was not aware of this zero downtime PR, very interesting, we can test it as well in our environment. I will try to look at it and gave a feedback if I find something interesting

@Tiduster Could you please post a minimal configuration showing which AMIs to use to get this up and running?

Just tried it, but with no success. Runner is up and working. But in case a job is processed, GitLab shows

Running with gitlab-runner 16.4.2 (e77af703)
  on prod-gitlab-ci-runner-test-Gitlab-Runner-TEST-A PsqsZYpLQ, system ID: s_0a07de49d04b
Resolving secrets 00:00
Preparing the "docker-autoscaler" executor 00:50
Dialing instance i-05caf9a7284ccaxxx...
Instance i-05caf9a7284ccaxxx connected
ERROR: Failed to remove network for build
ERROR: Preparation failed: error during connect: Get "http://internel.tunnel.invalid/v1.24/info": dialing environment connection: ssh: rejected: connect failed (open failed) (docker.go:826:0s)

The Runner shows in Cloudwatch

{
    "external-address": "",
    "instance-id": "i-05caf9a7284ccaxxx",
    "internal-address": "100.64.30.16",
    "job": 5314382,
    "level": "info",
    "msg": "Dialing instance",
    "project": 987,
    "runner": "PsqsZYpLQ",
    "time": "2024-06-06T07:40:32Z",
    "use-external-address": true
}
{"error":"networksManager is undefined","job":5314382,"level":"error","msg":"Failed to remove network for build","network":"","project":987,"runner":"PsqsZYpLQ","time":"2024-06-06T07:40:50Z"}

The first error seems to be related to use_external_addr = true in the config. Changed to false.

And I noticed that Docker was not installed on the Runner and ubuntu was not part of the docker group. After fixing that, my job was executed. AMI is ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20240531, but I have no idea how to change it.

[x] We mustn't use runner_worker_docker_machine_instance for configuration as it is tied to the docker+machine executor

@kayman-mk The installation of Docker is now mandatory indeed; I've mentioned it in the usage.md file (along with adding the user in docker group). I reused all the variables from runner_worker_docker_machine_instance to avoid the duplication of multiple variables. However, we could do it differently if we don't mind having multiple if conditions in a local block to determine which variable should be taken. What do you think?

@kayman-mk

And I noticed that Docker was not installed on the Runner and ubuntu was not part of the docker group. After fixing that, my job was executed. AMI is ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20240531, but I have no idea how to change it.

On our side we build a custom AMI from ubuntu and we add docker package manually. Docker autoscaler do not do this by default, so it require an AMI with docker engine to work.

We re-used runner_worker_docker_machine_ami_filter and runner_worker_docker_machine_ami_owners for this to no duplicate variables. We can create new variables if you prefer.

@mmoutama09 added some information about this in usage.md.

Best regards,

@kayman-mk I've updated my code to separate docker-autoscaler from docker+machine.

To use the new docker-autoscaler we must provide an AMI with docker installed and the user used by autoscaler to connect to workers must be added to docker group.

The variable docker-registry-mirror is no longer provided as it is not in the runner autoscaler configuration, but we could add it directly to the AMI.

Hmm, the need for a custom built image doesn't sound good to me at first hand. Any chance to use a pre-existing AMI instead? Or can we install Docker on the fly?

In case we want to host this AMI: Can you provide a built script (Packer?)?

@Tiduster Could you please share the PAcker scripts to build the AMI? Would be a good idea to have them available and/or publish an AMI here.

@kayman-mk here is a packer script to build the image with ubuntu 22 as a base ubuntu-docker.json

We could also try to install it at launch using userdata, but we would have to add a lifecycle to prevent the autoscaler from connecting to the instance too early. What do you think?

@kayman-mk I don't know if my previous suggestion was clear enough, one requirement for Docker autoscaler is to have Docker installed on the workers so that it can connect to the instances and launch the CI jobs. Docker Machine was doing this for us.

We tried to install it with user data in our launch template, but I think the installation is too late for Docker autoscaler, as the instance is perceived as ready for the autoscaler. Therefore, the easiest way is to start with an image that already has Docker installed.

However, if this is an issue, we could use user data to install Docker and try to enforce that the autoscaler only tries to connect to the worker when it's done. The autoscaling group need to wait before advertising that the worker is ready. For this, we would configure a lifecycle hook on the autoscaling group to wait for a signal from the instance indicating that Docker is installed. The only inconvenience is that this will increase the time waiting for the instance to be ready.

What are your thoughts on this?

What comes into my mind is, that we should provide an AMI ready to use with the new plugin. So I am thinking of creating a new project here adding your Packer script. The AMI can be hosted in one of my company's AWS accounts.

Without having an AMI is becomes hard to use this feature, right? It doesn't work out of the box.

In our mind, every company out there build it's own set of AMIs, tailored for its needs. If you provide an AMIs, I will never use it in our company, because it will be against our security policies. I will be surprised if someone use an unofficial external AMI in a production setup :-) .

We tried to find an official AMI with docker pre-installed, but with no success. On our side it doesn't matter, as we just use our own.

We can install docker on the fly with a lifecycle hook, but I won't recommend it, because it will be slower and riskier than a backed AMI.

Best regards,

@Tiduster Yeah, same here. We discussed it too. Maybe we shouldn't publish the AMI, but the Packer script only? So it is easy to create a new AMI as you don't have to start from scratch.

I understand we want the module to be up and running with the default configuration. Even if you write comprehensive documentation, the extract step will be an entry barrier for new users.

I did some testing with the official ECS-Optimized AWS AMI based on AL 2023 :

[root@ip-172-18-35-244 ~]# docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
[root@ip-172-18-35-244 ~]# rpm -qa | grep docker
docker-25.0.3-1.amzn2023.0.2.x86_64
[root@ip-172-18-35-244 ~]# docker --version
Docker version 25.0.3, build 4debf41
[root@ip-172-18-35-244 ~]# df -h
Filesystem        Size  Used Avail Use% Mounted on
devtmpfs          4.0M     0  4.0M   0% /dev
tmpfs             204M     0  204M   0% /dev/shm
tmpfs              82M  472K   82M   1% /run
/dev/nvme0n1p1     30G  2.5G   28G   9% /
tmpfs             204M     0  204M   0% /tmp
/dev/nvme0n1p128   10M  1.3M  8.7M  13% /boot/efi
tmpfs              41M     0   41M   0% /run/user/0

It contains docker by default. It's not too much bloated.

I initially didn't want to use it because it's not really a clean AMI with only docker, but I understand now it's better than nothing, and it will be useful for some user :-) .

Official documentation : https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html

AMI : https://aws.amazon.com/marketplace/pp/prodview-76p4vln3mhsj4

Let's use it by default ?

That seems to be a good solution. However I can't get it running.

Dialing instance i-0903db17458fc046b...
ERROR: Failed to remove network for build
ERROR: Preparation failed: preparing environment: dial ssh: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
Will be retried in 3s ...

Config

  runner_worker_docker_autoscaler = {
    fleeting_plugin_version = "0.4.0"
  }

  runner_worker_docker_autoscaler_ami_owners = ["591542846629"]
  runner_worker_docker_autoscaler_ami_filter = {
    name   = ["al2023-ami-ecs-hvm-2023.0.20240723-kernel-6.1-x86_64"]
  }
  runner_worker_docker_autoscaler_role = {
    additional_tags = local.tags
  }
  runner_worker_docker_autoscaler_asg = {
    subnet_ids = var.subnet_ids
  }

EDIT: Could be this external_address thing from above

ok, nice. Running on my machines with the AMI from @Tiduster

TODO:

[x] set external_addr to false
[x] check the autoscaler_asg variable. I think there are many attributes not related to the ASG. Split into several (see docker_machine)
[x] add example for easy setup
[x] needs some scaling actions to tear everything down at night/weekend. See docker+machine. Is this possible? EDIT: [[runners.autoscaler.policy]] seems to be a good choice

Why do we have the sg_ingresses introduced here? Just asking because there is no way to do the same for docker+machine

@Tiduster, @mmoutama09 How do I set all the Docker options (runner_worker_docker_options) and the registry mirror?

EDIT: Docker works via old Docker options.

Hm, tried again with version 1.0.0 of the plugin. Still have the connection issue described above.

Preparation failed: preparing environment: dial ssh: after retrying 30 times: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

Checked the security groups and did a ssh from the agent to the worker. Same error. So at least no problem with creating the connection.. And I do not see any other errors in the log.

EDIT: Not sure how this is internally working. I tried sending a temporary key and login with that one. Still no success.

root@ip-192-64-74-36 bin]# aws ec2-instance-connect send-ssh-public-key --instance-id i-015020b95c394df73 --availability-zone "eu-central-1b" --instance-os-user ec2-user --ssh-public-key file:///root/.ssh/id_ed25519.pub --region eu-central-1
{
    "RequestId": "2c73837e-cf97-42f5-a415-0b0639a8b196",
    "Success": true
}
[root@ip-192-64-74-36 bin]# ssh ec2-user@192.64.35.158
The authenticity of host '192.64.35.158 (192.64.35.158)' can't be established.
ED25519 key fingerprint is SHA256:i8JpNRBtjgOCatirWNpfJgMbDQt9Yp5ZtgHHhMD0pvk.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '192.64.35.158' (ED25519) to the list of known hosts.
ec2-user@192.64.35.158: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

AMI is amazon/al2023-ami-ecs-hvm-2023.0.20240723-kernel-6.1-x86_64

It uses EC2 connect in the background, as visible in the recommended IAM Policy: https://gitlab.com/gitlab-org/fleeting/plugins/aws#recommended-iam-policy

We didn't have this issue on our side, maybe something is missing in the module.

We can fix it monday :-) .

After more testing, the ECS official AMI is just not compatible.

We didn't dive too much, because as it's not working out of the box, using this AMI is useless in our use case :-( . We assume there is something in the SSH default configuration or something that is preventing fleeting plugin to connect to the worker when this AMI is in use.

We DO NOT have this issue with a simple custom AMI pre-installed with docker, using Ubuntu or Amazon Linux 2023.

Unfortunately, baking a custom AMI is once again mandatory.

To be merged now. Before the release I will do a last check. Everything is working out of the box. Provisioning new machines seems to be 25% faster. I love it!

Thanks for all the fixes. We will switch part of our production setup on this new version ASAP to detect any remaining issue.

Hopefully it has less bug than docker+machine ^^·

cattle-ops / terraform-aws-gitlab-runner

feat: add docker autoscaler executor #1118

Description

Verification