eu-nebulous / sal-scripts

Mozilla Public License 2.0
0 stars 0 forks source link

Failed to build "kubeadm join" command #14

Closed robert-sanfeliu closed 2 weeks ago

robert-sanfeliu commented 2 weeks ago

Worker start scripts executes a Kubernetes join command:

#!/bin/bash
echo "Worker start script"
sudo kubeadm reset --force
echo $variables_kubeCommand
sudo $variables_kubeCommand

This command is built in the "getKubeToken" method of the wait_for_master step on the worker deployment flow:

def getKubeToken(){
    def soutToken = new StringBuilder(), serrToken = new StringBuilder()
    // def proc = "kubeadm token create \"$(kubeadm token generate)\" --print-join-command --ttl=1h >  /tmp/join_call.txt".execute()
    def procToken = "kubeadm token generate".execute()
    procToken.consumeProcessOutput(soutToken, serrToken)
    procToken.waitForOrKill(1000)
    println "out> ${soutToken}\nerr> ${serrToken}"

    def soutCommand = new StringBuilder(), serrCommand = new StringBuilder()
    procCommand="kubeadm token create ${soutToken} --print-join-command --ttl=1h".execute()
    procCommand.consumeProcessOutput(soutCommand, serrCommand)
    procCommand.waitForOrKill(1000)
    println "out> ${soutCommand}\nerr> ${serrCommand}"
    variables.put("kubeCommand", soutCommand)

During the deployment of a cluster with 1 master and two workers on nebulous-cd the script workerd OK for one of the workers but failed for the other one. All machines (worker and masters) are AWS m4.xlarge:

Logs for the successful node (job ID 2263):

[2263t0@proactive-server;acquireAWSNode_n13303-1-dummy-app-worker-1-1-13303-1_Task_0;11:33:55] (3/3) Logging out ... ... OK !
[2263t1@ip-172-31-38-57.ec2.internal;wait_for_master;11:41:45] out> 8ey5tj.mygidax9tx5ysobg
[2263t1@ip-172-31-38-57.ec2.internal;wait_for_master;11:41:45] err>
 [2263t1@ip-172-31-38-57.ec2.internal;wait_for_master;11:41:45] out> kubeadm join 192.168.55.1:6443 --token 8ey5tj.mygidax9tx5ysobg --discovery-token-ca-cert-hash  sha256:b70e557a5e8ede9a4444e004c955c632e9daecb58237790547c2632d697e88b5
[2263t1@ip-172-31-38-57.ec2.internal;wait_for_master;11:41:45] err>
[2263t1@ip-172-31-38-57.ec2.internal;wait_for_master;11:41:45] Address was retrieved from checkip.amazonaws.com
[2263t1@ip-172-31-38-57.ec2.internal;wait_for_master;11:41:45] Public IP: 54.210.248.245
[2263t2@ip-172-31-40-14.ec2.internal;prepareInfra_n13303-1-dummy-app-worker-1-1-13303-1_Task_0;11:42:17] Exited the while loop, time spent: 0

See the line out kubeadm join .... that corresponds to a successful execution of the command and shows the command that will be later be executed.

Logs for the faulty node (job ID 2262):

[2262t1@ip-172-31-38-57.ec2.internal;wait_for_master;11:41:09] out> xlwycq.1ty77pou0lnpj28t
[2262t1@ip-172-31-38-57.ec2.internal;wait_for_master;11:41:09] err>
[2262t1@ip-172-31-38-57.ec2.internal;wait_for_master;11:41:10] out>
[2262t1@ip-172-31-38-57.ec2.internal;wait_for_master;11:41:10] err>
[2262t1@ip-172-31-38-57.ec2.internal;wait_for_master;11:41:10] Address was retrieved from checkip.amazonaws.com
[2262t1@ip-172-31-38-57.ec2.internal;wait_for_master;11:41:10] Public IP: 54.210.248.245

The out in this occasion is not showin anything. This, will cause later on the command sudo $variables_kubeCommand to fail.

robert-sanfeliu commented 2 weeks ago

I did another deployment using the same conditions and now it worked. The issue might be related to the 10 seconds timeout of the command:

procCommand.waitForOrKill(1000)
ankicabarisic commented 2 weeks ago

The only case that I can say could be related to SAL/Proactive in the scenario you have sent is that worker who failed did so because the master was not yet deployed. Was this the case?

Can you please point out the successful execution (master and workers' proactive jobId)

What is the getKubeToken method and how is related to this scenario?

robert-sanfeliu commented 2 weeks ago

The only case that I can say could be related to SAL/Proactive in the scenario you have sent is that worker who failed did so because the master was not yet deployed. Was this the case? No, as far as I know, please, check the logs to confirm.

Can you please point out the successful execution (master and workers' proactive jobId) Worker's (both successful and faulty) job IDs are already provided in the original bug report. Master job id is 2261

What is the getKubeToken method and how is related to this scenario? getKubeToken Is a method defined in the wait_for_master step on the worker deployment flow

image

And I assume it was provided by Ali

ankicabarisic commented 2 weeks ago

I was asking for the IDs of the same successful execution. I saw the failing ones. However, nothing here actually makes a lot of sense as the wait_for_master was executed successfully. Also, I checked that execution of the failing task started AFTER the master was deployed, and this is ok (the master was deployed 13:40 and the failing task started 13:48)

So changing this script will not solve the issue you had.

The problem is on the task where you use the START script for the worker node, and you can see that it was the script which failed: image

So kubeadm join command is executed correctly in your tasks. I will close this ticket.

ankicabarisic commented 2 weeks ago

just for record this is execution of 'wait for master' task for the failing job: image