falldamagestudio / UE-Jenkins-BuildSystem

Build Unreal Engine & games with Jenkins on GKE/GCE
MIT License
46 stars 11 forks source link

Jenkins kills/recreates pod every ~7 minutes during job startup #22

Closed Kalmalyzer closed 3 years ago

Kalmalyzer commented 3 years ago

With slaveConnectTimeout raised to 1800 seconds (30 minutes) in the pod template, deploying a Windows job - that takes ~15 minutes for creating a new node and pulling 14GB of layers - results in the pod getting destroyed & recreated twice before the pod is successfully created.

These additional destroy/recreate steps are probably harmless, but they cause noise in the logs. Where's the timeout that is getting hit? It might be 7 minutes (420 seconds), it might be 400 seconds, something like that.

Kalmalyzer commented 3 years ago

If I change Jenkins' pod retention to Always, then more and more pods (with on-demand created nodes) get stacked up. The first pod claims the persistent volume mount; subsequent pods wait on the volume mount, and therefore do not start their containers.

It looks like the termination is initiated by the controller side. Also, it looks like it happens 6 minutes (360 seconds) after provisioning has completed.

This is the log from the executor, starting at 6:32:

Mar 22, 2021 6:32:00 AM INFO hudson.slaves.NodeProvisioner lambda$update$6
ue.win64-4-9wsxh-q1df2-321t5 provisioning successfully completed. We have now 2 computer(s)
Mar 22, 2021 6:32:01 AM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch
Created Pod: kubernetes default/ue.win64-4-9wsxh-q1df2-321t5
Mar 22, 2021 6:38:06 AM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave _terminate
Terminating Kubernetes instance for agent ue.win64-4-9wsxh-q1df2-321t5
...

Notice how there's a termination event ~6 minutes after the provisioning had completed.

This is the log from the first pod in such a scenario:

Mar 22, 2021 6:50:27 AM hudson.remoting.jnlp.Main createEngine
INFO: Setting up agent: ue.win64-4-9wsxh-q1df2-321t5
Mar 22, 2021 6:50:27 AM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Mar 22, 2021 6:50:27 AM hudson.remoting.Engine startEngine
INFO: Using Remoting version: 4.6
Mar 22, 2021 6:50:27 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using C:\J\remoting as a remoting work directory
Mar 22, 2021 6:50:27 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
INFO: Both error and output logs will be printed to C:\J\remoting
Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among [http://jenkins-controller:8080/]
Mar 22, 2021 6:50:28 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
Mar 22, 2021 6:50:28 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting TCP connection tunneling is enabled. Skipping the TCP Agent Listener Port availability check
Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Agent discovery successful
  Agent address: jenkins-controller-agent
  Agent port:    50000
  Identity:      e4:50:fa:60:19:bb:89:6f:c8:ca:d8:bf:e3:92:ce:94
Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to jenkins-controller-agent:50000
Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Trying protocol: JNLP4-connect
Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Remote identity confirmed: e4:50:fa:60:19:bb:89:6f:c8:ca:d8:bf:e3:92:ce:94
Mar 22, 2021 6:50:28 AM org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer onRecv
INFO: [JNLP4-connect connection to jenkins-controller-agent/10.28.12.64:50000] Local headers refused by remote: Unknown client name: ue.win64-4-9wsxh-q1df2-321t5
Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Protocol JNLP4-connect encountered an unexpected exception
java.util.concurrent.ExecutionException: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: ue.win64-4-9wsxh-q1df2-321t5
        at org.jenkinsci.remoting.util.SettableFuture.get(SettableFuture.java:223)
        at hudson.remoting.Engine.innerRun(Engine.java:739)
        at hudson.remoting.Engine.run(Engine.java:514)
Caused by: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: ue.win64-4-9wsxh-q1df2-321t5
        at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.newAbortCause(ConnectionHeadersFilterLayer.java:377)
        at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.onRecvClosed(ConnectionHeadersFilterLayer.java:432)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:817)
        at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:288)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:170)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:817)
        at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
        at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1500(BIONetworkLayer.java:48)
        at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:247)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:119)
        at java.lang.Thread.run(Thread.java:748)
        Suppressed: java.nio.channels.ClosedChannelException
                ... 7 more

Mar 22, 2021 6:50:28 AM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: The server rejected the connection: None of the protocols were accepted
java.lang.Exception: The server rejected the connection: None of the protocols were accepted
        at hudson.remoting.Engine.onConnectionRejected(Engine.java:824)
        at hudson.remoting.Engine.innerRun(Engine.java:764)
        at hudson.remoting.Engine.run(Engine.java:514)

In here, the pod became unaware of the controller-side termination until after the jnlp container tries starting up, handshaking with the controller, at which point the agent no longer recognizes the agent's ID.

Kalmalyzer commented 3 years ago

Root cause + discussion at https://issues.jenkins.io/browse/JENKINS-65249

Kalmalyzer commented 3 years ago

Adding idleMinutes 30 to each Windows podTemplate sidesteps these problems. However, it probably also leaves the agent running for that much longer after a successful build (and node scale-down only occurs a certain time after the agent has been terminated, I presume).

Kalmalyzer commented 3 years ago

Out of scope; won't fix, closing.