Dynamic Docker Linux agents sometimes fail to launch

Kalmalyzer commented 3 years ago

There is a race condition in the Linux agent VMs:

sshd is set to start automatically in the VM image itself
The cloud-init script stops sshd at the beginning, then does its initialization, and then starts sshd again
This can potentially lead to race conditions, either A) there is a short time window when sshd is running, at the start of cloud-init, or B) if cloud-init's stop of sshd happens before sshd autostarts, then sshd remains running throughout the rest of the cloud-init script
Jenkins uses "able to log in via SSH" as an indicator that the agent is ready to take commands
If sshd happens to be enabled inadvertently, and Jenkins happens to successfully connect & log in via SSH before the cloud-init script has completed, then the next Jenkins step is to check the Java version; this will fail, because cloud-init hasn't placed the appropriate files into the filesystem yet

When the above happens, Jenkins considers the agent as having failed launching. Jenkins will delete the agent shortly thereafter, the GCE plugin stops the VM (or deletes if there are too many VMs of that type already), and the GCE plugin triggers provisioning again.

This happens at maybe 10% of launches. It delays job start by minute(s). A VM that failed launching is not bad; it does not need to be deleted; it is likely to work well if it is stopped/started and used in a subsequent provisioning attempt.

Example of such a launch failure:

Aug 03, 2021 3:04:39 PM FINEST: Instance build-game-linux-dynamic-uvsvqm is running and ready...
Aug 03, 2021 3:04:40 PM INFO: Launching instance: build-game-linux-dynamic-uvsvqm
Aug 03, 2021 3:04:40 PM INFO: bootstrap
Aug 03, 2021 3:04:40 PM INFO: Getting keypair...
Aug 03, 2021 3:04:40 PM INFO: Using autogenerated keypair
Aug 03, 2021 3:04:40 PM INFO: Authenticating as jenkins-ssh
Aug 03, 2021 3:04:40 PM INFO: Connecting to 34.78.153.62 on port 22, with timeout 10000.
Aug 03, 2021 3:04:43 PM INFO: Connected via SSH.
Aug 03, 2021 3:04:43 PM INFO: Verifying: /run/jenkins-agent-wrapper.sh -fullversion
bash: /run/jenkins-agent-wrapper.sh: No such file or directory
Aug 03, 2021 3:04:43 PM WARNING: Java is not installed at /run/jenkins-agent-wrapper.sh

If we could prevent sshd from automatically launching, we could side step this. We could absolutely do this if we stopped using COS+cloud-init and created our own VMs from scratch instead.

Kalmalyzer commented 3 years ago

And sometimes (I have only seen this once in a week's worth of testing) it gets stuck like this:

Aug 09, 2021 2:39:48 PM Launching instance: build-game-linux-dynamic-hq2x1x
Aug 09, 2021 2:39:48 PM bootstrap
Aug 09, 2021 2:39:48 PM Getting keypair...
Aug 09, 2021 2:39:48 PM Using autogenerated keypair
Aug 09, 2021 2:39:48 PM Authenticating as jenkins-ssh
Aug 09, 2021 2:39:48 PM Connecting to 35.195.75.131 on port 22, with timeout 10000.
Aug 09, 2021 2:39:55 PM Connected via SSH.
Aug 09, 2021 2:39:56 PM Authentication failed. Trying again...
Aug 09, 2021 2:40:11 PM Authenticating as jenkins-ssh
Aug 09, 2021 2:40:11 PM Connecting to 35.195.75.131 on port 22, with timeout 10000.
Aug 09, 2021 2:40:11 PM Connected via SSH.
Aug 09, 2021 2:40:11 PM Authentication failed. Trying again...
Aug 09, 2021 2:40:26 PM Authenticating as jenkins-ssh
Aug 09, 2021 2:40:26 PM Connecting to 35.195.75.131 on port 22, with timeout 10000.
Aug 09, 2021 2:40:26 PM Connected via SSH.
Aug 09, 2021 2:40:26 PM Authentication failed. Trying again...
Aug 09, 2021 2:40:41 PM Authenticating as jenkins-ssh
Aug 09, 2021 2:40:42 PM Connecting to 35.195.75.131 on port 22, with timeout 10000.
Aug 09, 2021 2:40:42 PM Connected via SSH.
Aug 09, 2021 2:40:42 PM Authentication failed. Trying again...
Aug 09, 2021 2:40:57 PM Authenticating as jenkins-ssh
Aug 09, 2021 2:40:57 PM Connecting to 35.195.75.131 on port 22, with timeout 10000.
Aug 09, 2021 2:40:57 PM Connected via SSH.
Aug 09, 2021 2:40:57 PM Authentication failed. Trying again...
Aug 09, 2021 2:41:12 PM Authenticating as jenkins-ssh
Aug 09, 2021 2:41:12 PM Connecting to 35.195.75.131 on port 22, with timeout 10000.
Aug 09, 2021 2:41:12 PM Connected via SSH.
Aug 09, 2021 2:41:12 PM Authentication failed. Trying again...
Aug 09, 2021 2:41:27 PM Authenticating as jenkins-ssh
Aug 09, 2021 2:41:28 PM Connecting to 35.195.75.131 on port 22, with timeout 10000.
Aug 09, 2021 2:41:28 PM Connected via SSH.
Aug 09, 2021 2:41:28 PM Authentication failed. Trying again...

Most likely guess is, the SSH agent needs to be restarted to pick up the auth key or somesuch. In this case, it tries for 300 seconds, then reports an error, then continues trying, possibly being stuck forever.

Kalmalyzer commented 3 years ago

Protecting sshd from launching is really blunt and will make any fail scenarios (stuff go wrong during bootup) real hard to debug. We should look for something that just prevents one account from auth'ing, or something that pauses the initial flow for that account until everything is in-place.

Kalmalyzer commented 3 years ago

These problems have been solved in the non-Docker Linux VMs (the ones based on Debian, built with Packer, not using COS + cloud-init).

The best way to sort this out would probably be to move over to Debian/Packer images for the Docker Linux VMs as well. However, we'd need to find a solution to the UID problems then.

Kalmalyzer commented 3 years ago

The Docker (non-Kubernetes) agents have been retired. They didn't add much and were generally more troublesome when something went wrong.

Closing this as won't fix.

falldamagestudio / UE-Jenkins-BuildSystem

Dynamic Docker Linux agents sometimes fail to launch #40