Closed Kalmalyzer closed 3 years ago
If I change Jenkins' pod retention to Always, then more and more pods (with on-demand created nodes) get stacked up. The first pod claims the persistent volume mount; subsequent pods wait on the volume mount, and therefore do not start their containers.
It looks like the termination is initiated by the controller side. Also, it looks like it happens 6 minutes (360 seconds) after provisioning has completed.
This is the log from the executor, starting at 6:32:
Mar 22, 2021 6:32:00 AM INFO hudson.slaves.NodeProvisioner lambda$update$6
ue.win64-4-9wsxh-q1df2-321t5 provisioning successfully completed. We have now 2 computer(s)
Mar 22, 2021 6:32:01 AM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch
Created Pod: kubernetes default/ue.win64-4-9wsxh-q1df2-321t5
Mar 22, 2021 6:38:06 AM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave _terminate
Terminating Kubernetes instance for agent ue.win64-4-9wsxh-q1df2-321t5
...
Notice how there's a termination event ~6 minutes after the provisioning had completed.
This is the log from the first pod in such a scenario:
Mar 22, 2021 6:50:27 AM hudson.remoting.jnlp.Main createEngine
INFO: Setting up agent: ue.win64-4-9wsxh-q1df2-321t5
Mar 22, 2021 6:50:27 AM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Mar 22, 2021 6:50:27 AM hudson.remoting.Engine startEngine
INFO: Using Remoting version: 4.6
Mar 22, 2021 6:50:27 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using C:\J\remoting as a remoting work directory
Mar 22, 2021 6:50:27 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
INFO: Both error and output logs will be printed to C:\J\remoting
Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among [http://jenkins-controller:8080/]
Mar 22, 2021 6:50:28 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
Mar 22, 2021 6:50:28 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting TCP connection tunneling is enabled. Skipping the TCP Agent Listener Port availability check
Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Agent discovery successful
Agent address: jenkins-controller-agent
Agent port: 50000
Identity: e4:50:fa:60:19:bb:89:6f:c8:ca:d8:bf:e3:92:ce:94
Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to jenkins-controller-agent:50000
Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Trying protocol: JNLP4-connect
Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Remote identity confirmed: e4:50:fa:60:19:bb:89:6f:c8:ca:d8:bf:e3:92:ce:94
Mar 22, 2021 6:50:28 AM org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer onRecv
INFO: [JNLP4-connect connection to jenkins-controller-agent/10.28.12.64:50000] Local headers refused by remote: Unknown client name: ue.win64-4-9wsxh-q1df2-321t5
Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Protocol JNLP4-connect encountered an unexpected exception
java.util.concurrent.ExecutionException: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: ue.win64-4-9wsxh-q1df2-321t5
at org.jenkinsci.remoting.util.SettableFuture.get(SettableFuture.java:223)
at hudson.remoting.Engine.innerRun(Engine.java:739)
at hudson.remoting.Engine.run(Engine.java:514)
Caused by: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: ue.win64-4-9wsxh-q1df2-321t5
at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.newAbortCause(ConnectionHeadersFilterLayer.java:377)
at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.onRecvClosed(ConnectionHeadersFilterLayer.java:432)
at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:817)
at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:288)
at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:170)
at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:817)
at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1500(BIONetworkLayer.java:48)
at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:247)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:119)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.nio.channels.ClosedChannelException
... 7 more
Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: The server rejected the connection: None of the protocols were accepted
java.lang.Exception: The server rejected the connection: None of the protocols were accepted
at hudson.remoting.Engine.onConnectionRejected(Engine.java:824)
at hudson.remoting.Engine.innerRun(Engine.java:764)
at hudson.remoting.Engine.run(Engine.java:514)
In here, the pod became unaware of the controller-side termination until after the jnlp
container tries starting up, handshaking with the controller, at which point the agent no longer recognizes the agent's ID.
Root cause + discussion at https://issues.jenkins.io/browse/JENKINS-65249
Adding idleMinutes 30
to each Windows podTemplate sidesteps these problems. However, it probably also leaves the agent running for that much longer after a successful build (and node scale-down only occurs a certain time after the agent has been terminated, I presume).
Out of scope; won't fix, closing.
With
slaveConnectTimeout
raised to 1800 seconds (30 minutes) in the pod template, deploying a Windows job - that takes ~15 minutes for creating a new node and pulling 14GB of layers - results in the pod getting destroyed & recreated twice before the pod is successfully created.These additional destroy/recreate steps are probably harmless, but they cause noise in the logs. Where's the timeout that is getting hit? It might be 7 minutes (420 seconds), it might be 400 seconds, something like that.