Closed Kalmalyzer closed 3 years ago
False alarm; I was testing with maxNumInstancesToCreateStr
set to 2 at that time.
Taking that back: with max instances = 2, there was already 1 instance provisioned, queueing up a bunch of jobs resulted in 2 nodes being provisioned. Example snippet:
Aug 06, 2021 7:56:45 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud provision
Provisioning node from configs [com.google.jenkins.plugins.computeengine.InstanceConfiguration@528514ae] for excess workload of 3 units of label 'build-game-linux-git-dynamic'
Aug 06, 2021 7:56:45 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud availableNodeCapacity
Found capacity for 10 nodes in cloud GCE
Aug 06, 2021 7:56:45 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud logConfigAndInstanceResult
Cloud provider GCE - choosing configuration Build game Linux Dynamic VM but no existing instance for provisioning a new node
Aug 06, 2021 7:56:47 PM INFO com.google.jenkins.plugins.computeengine.InstanceConfiguration provision
Sent insert request for instance configuration [Build game Linux Dynamic VM]
Aug 06, 2021 7:56:47 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineComputerLauncher launch
Launch will wait 300000 for operation operation-1628279806229-5c8e96f71fdd8-8c96557e-9e509970 to complete...
Aug 06, 2021 7:56:47 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud lambda$getPlannedNodeFuture$5
Waiting 300000ms for node build-game-linux-dynamic-cp2fvm to connect
Aug 06, 2021 7:56:47 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud logConfigAndInstanceResult
Cloud provider GCE - choosing configuration Build game Linux Dynamic VM but no existing instance for provisioning a new node
Aug 06, 2021 7:56:49 PM INFO com.google.jenkins.plugins.computeengine.InstanceConfiguration provision
Sent insert request for instance configuration [Build game Linux Dynamic VM]
Aug 06, 2021 7:56:49 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineComputerLauncher launch
Launch will wait 300000 for operation operation-1628279807986-5c8e96f8ccc68-e1d31bf2-1d75db5f to complete...
Aug 06, 2021 7:56:49 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud lambda$getPlannedNodeFuture$5
Waiting 300000ms for node build-game-linux-dynamic-jz4ph5 to connect
Aug 06, 2021 7:56:49 PM WARNING com.google.jenkins.plugins.computeengine.ComputeEngineCloud logConfigAndInstanceResult
Could not provision new nodes to meet demand. All instance configurations for cloud provider GCE have reached their max num instances
Aug 06, 2021 7:56:58 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Launching instance: build-game-linux-dynamic-cp2fvm
Aug 06, 2021 7:56:58 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
bootstrap
Aug 06, 2021 7:56:58 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Getting keypair...
Aug 06, 2021 7:56:58 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Using autogenerated keypair
Aug 06, 2021 7:56:58 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Authenticating as jenkins-ssh
Aug 06, 2021 7:56:58 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Connecting to 34.77.61.112 on port 22, with timeout 10000.
Aug 06, 2021 7:56:59 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Launching instance: build-game-linux-dynamic-jz4ph5
Aug 06, 2021 7:56:59 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
bootstrap
Aug 06, 2021 7:56:59 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Getting keypair...
Aug 06, 2021 7:56:59 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Using autogenerated keypair
Aug 06, 2021 7:56:59 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Authenticating as jenkins-ssh
Aug 06, 2021 7:57:00 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Connecting to 34.140.24.212 on port 22, with timeout 10000.
Aug 06, 2021 7:57:14 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Connected via SSH.
Aug 06, 2021 7:57:14 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Verifying: /run/jenkins-agent-wrapper.sh -fullversion
Aug 06, 2021 7:57:14 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Copying agent.jar to: /mnt/disks/persistent-disk
Aug 06, 2021 7:57:14 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Launching Jenkins agent via plugin SSH: /run/jenkins-agent-wrapper.sh -jar /mnt/disks/persistent-disk/agent.jar
Aug 06, 2021 7:57:15 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Connected via SSH.
Aug 06, 2021 7:57:15 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Verifying: /run/jenkins-agent-wrapper.sh -fullversion
Aug 06, 2021 7:57:15 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Copying agent.jar to: /mnt/disks/persistent-disk
Aug 06, 2021 7:57:15 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Launching Jenkins agent via plugin SSH: /run/jenkins-agent-wrapper.sh -jar /mnt/disks/persistent-disk/agent.jar
Aug 06, 2021 7:57:43 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud lambda$getPlannedNodeFuture$5
56169ms elapsed waiting for node build-game-linux-dynamic-cp2fvm to connect
Aug 06, 2021 7:57:47 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud lambda$getPlannedNodeFuture$5
58281ms elapsed waiting for node build-game-linux-dynamic-jz4ph5 to connect
My take on this is that there is a gap between requesting provisioning of an instance, and said instance showing up in the list returned by the GCP API. Extra provisioning checks during this time interval don't take the node creation that is in-flight into account. We should ideally cover those gaps somehow.
Throwing lots of build jobs at GCE, when having a max instance count of 1, and persist count of 1, will sometimes result in 2 instances being up at the same time. This should not happen; it should not provision that second instance in such a situation.
Later on, when it's time to deprovision agents, then excess instances get deleted - so the extra agents live only temporarily.
The main problem this causes is that even if we set "max 1" or "max 2" for a single group, then there will regularly be new instances provisioned. These haven't cached the Docker images yet. These will have long startup times for their first jobs. This means that the Jenkins system will never reach a 'steady state' where all its dynamic nodes have the Docker images cached and job launch times become predictably low.