Closed robert-sanfeliu closed 2 weeks ago
I did another deployment using the same conditions and now it worked. The issue might be related to the 10 seconds timeout of the command:
procCommand.waitForOrKill(1000)
The only case that I can say could be related to SAL/Proactive in the scenario you have sent is that worker who failed did so because the master was not yet deployed. Was this the case?
Can you please point out the successful execution (master and workers' proactive jobId)
What is the getKubeToken method and how is related to this scenario?
The only case that I can say could be related to SAL/Proactive in the scenario you have sent is that worker who failed did so because the master was not yet deployed. Was this the case? No, as far as I know, please, check the logs to confirm.
Can you please point out the successful execution (master and workers' proactive jobId) Worker's (both successful and faulty) job IDs are already provided in the original bug report. Master job id is 2261
What is the getKubeToken method and how is related to this scenario? getKubeToken Is a method defined in the wait_for_master step on the worker deployment flow
And I assume it was provided by Ali
I was asking for the IDs of the same successful execution. I saw the failing ones. However, nothing here actually makes a lot of sense as the wait_for_master was executed successfully. Also, I checked that execution of the failing task started AFTER the master was deployed, and this is ok (the master was deployed 13:40 and the failing task started 13:48)
So changing this script will not solve the issue you had.
The problem is on the task where you use the START script for the worker node, and you can see that it was the script which failed:
So kubeadm join command is executed correctly in your tasks. I will close this ticket.
just for record this is execution of 'wait for master' task for the failing job:
Worker start scripts executes a Kubernetes join command:
This command is built in the "getKubeToken" method of the
wait_for_master
step on the worker deployment flow:During the deployment of a cluster with 1 master and two workers on nebulous-cd the script workerd OK for one of the workers but failed for the other one. All machines (worker and masters) are AWS m4.xlarge:
Logs for the successful node (job ID 2263):
See the line
out kubeadm join ....
that corresponds to a successful execution of the command and shows the command that will be later be executed.Logs for the faulty node (job ID 2262):
The
out
in this occasion is not showin anything. This, will cause later on the commandsudo $variables_kubeCommand
to fail.