Closed Kalmalyzer closed 3 years ago
And sometimes (I have only seen this once in a week's worth of testing) it gets stuck like this:
Aug 09, 2021 2:39:48 PM Launching instance: build-game-linux-dynamic-hq2x1x
Aug 09, 2021 2:39:48 PM bootstrap
Aug 09, 2021 2:39:48 PM Getting keypair...
Aug 09, 2021 2:39:48 PM Using autogenerated keypair
Aug 09, 2021 2:39:48 PM Authenticating as jenkins-ssh
Aug 09, 2021 2:39:48 PM Connecting to 35.195.75.131 on port 22, with timeout 10000.
Aug 09, 2021 2:39:55 PM Connected via SSH.
Aug 09, 2021 2:39:56 PM Authentication failed. Trying again...
Aug 09, 2021 2:40:11 PM Authenticating as jenkins-ssh
Aug 09, 2021 2:40:11 PM Connecting to 35.195.75.131 on port 22, with timeout 10000.
Aug 09, 2021 2:40:11 PM Connected via SSH.
Aug 09, 2021 2:40:11 PM Authentication failed. Trying again...
Aug 09, 2021 2:40:26 PM Authenticating as jenkins-ssh
Aug 09, 2021 2:40:26 PM Connecting to 35.195.75.131 on port 22, with timeout 10000.
Aug 09, 2021 2:40:26 PM Connected via SSH.
Aug 09, 2021 2:40:26 PM Authentication failed. Trying again...
Aug 09, 2021 2:40:41 PM Authenticating as jenkins-ssh
Aug 09, 2021 2:40:42 PM Connecting to 35.195.75.131 on port 22, with timeout 10000.
Aug 09, 2021 2:40:42 PM Connected via SSH.
Aug 09, 2021 2:40:42 PM Authentication failed. Trying again...
Aug 09, 2021 2:40:57 PM Authenticating as jenkins-ssh
Aug 09, 2021 2:40:57 PM Connecting to 35.195.75.131 on port 22, with timeout 10000.
Aug 09, 2021 2:40:57 PM Connected via SSH.
Aug 09, 2021 2:40:57 PM Authentication failed. Trying again...
Aug 09, 2021 2:41:12 PM Authenticating as jenkins-ssh
Aug 09, 2021 2:41:12 PM Connecting to 35.195.75.131 on port 22, with timeout 10000.
Aug 09, 2021 2:41:12 PM Connected via SSH.
Aug 09, 2021 2:41:12 PM Authentication failed. Trying again...
Aug 09, 2021 2:41:27 PM Authenticating as jenkins-ssh
Aug 09, 2021 2:41:28 PM Connecting to 35.195.75.131 on port 22, with timeout 10000.
Aug 09, 2021 2:41:28 PM Connected via SSH.
Aug 09, 2021 2:41:28 PM Authentication failed. Trying again...
Most likely guess is, the SSH agent needs to be restarted to pick up the auth key or somesuch. In this case, it tries for 300 seconds, then reports an error, then continues trying, possibly being stuck forever.
Protecting sshd
from launching is really blunt and will make any fail scenarios (stuff go wrong during bootup) real hard to debug. We should look for something that just prevents one account from auth'ing, or something that pauses the initial flow for that account until everything is in-place.
These problems have been solved in the non-Docker Linux VMs (the ones based on Debian, built with Packer, not using COS + cloud-init).
The best way to sort this out would probably be to move over to Debian/Packer images for the Docker Linux VMs as well. However, we'd need to find a solution to the UID problems then.
The Docker (non-Kubernetes) agents have been retired. They didn't add much and were generally more troublesome when something went wrong.
Closing this as won't fix.
There is a race condition in the Linux agent VMs:
sshd
is set to start automatically in the VM image itselfsshd
at the beginning, then does its initialization, and then startssshd
againsshd
is running, at the start of cloud-init, or B) if cloud-init's stop ofsshd
happens beforesshd
autostarts, thensshd
remains running throughout the rest of the cloud-init scriptsshd
happens to be enabled inadvertently, and Jenkins happens to successfully connect & log in via SSH before the cloud-init script has completed, then the next Jenkins step is to check the Java version; this will fail, because cloud-init hasn't placed the appropriate files into the filesystem yetWhen the above happens, Jenkins considers the agent as having failed launching. Jenkins will delete the agent shortly thereafter, the GCE plugin stops the VM (or deletes if there are too many VMs of that type already), and the GCE plugin triggers provisioning again.
This happens at maybe 10% of launches. It delays job start by minute(s). A VM that failed launching is not bad; it does not need to be deleted; it is likely to work well if it is stopped/started and used in a subsequent provisioning attempt.
Example of such a launch failure:
If we could prevent
sshd
from automatically launching, we could side step this. We could absolutely do this if we stopped using COS+cloud-init and created our own VMs from scratch instead.