Closed omus closed 3 years ago
One possible solution to this problem would be to use kubectl logs
and wait for the julia_worker:<port>#<ip>
message. This would also no longer require a port to be specified as the manager could just read it from the worker logs.
Another CI example: https://github.com/beacon-biosignals/K8sClusterManagers.jl/runs/2442606710
Another one: https://github.com/beacon-biosignals/K8sClusterManagers.jl/runs/2442682308. I'm increasing the duration waited from 2 seconds to 4 seconds in #47 to try an work around the problem for now.
Another viable option would be to use: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate
Should be fixed by #57 as the cluster manager will now wait for the output from the worker before trying to connect.
I originally noticed this error as part of another PR (https://github.com/beacon-biosignals/K8sClusterManagers.jl/pull/44#issuecomment-826976309) but have also observed this failure when removing this
sleep
call.I believe what is happening is that we are waiting for the pod to be running but attempt to connect to the pod before the worker actually starts listening. Since the original failure as shown above occurred before this
sleep(2)
call was removed probably there is some variability in how long it takes Julia to start listening.