Open shrinandj opened 1 year ago
Looks like the pod failed because kubelet errored out with "No sandbox for pod"
2023-01-27T02:55:57.328Z {"hostname":"ip-10-10-17-70.us-west-2.compute.internal","systemd_unit":"kubelet.service","message":"I0127 02:55:57.328640 4720 operation_generator.go:714] MountVolume.SetUp succeeded for volume \"kube-api-access-z2zxj\" (UniqueName: \"kubernetes.io/projected/03ec9a24-bfa3-4474-a02a-54abe2a74805-kube-api-access-z2zxj\") pod \"t-n9s26-cnmc4\" (UID: \"03ec9a24-bfa3-4474-a02a-54abe2a74805\")"}
2023-01-27T02:55:57.422Z {"hostname":"ip-10-10-17-70.us-west-2.compute.internal","systemd_unit":"kubelet.service","message":"I0127 02:55:57.422174 4720 kuberuntime_manager.go:484] \"No sandbox for pod can be found. Need to start a new one\" pod=\"jobs-default/t-n9s26-cnmc4\""}
2023-01-27T02:55:58.101Z {"hostname":"ip-10-10-17-70.us-west-2.compute.internal","systemd_unit":"kubelet.service","message":"I0127 02:55:58.101069 4720 generic.go:296] \"Generic (PLEG): container finished\" podID=32d059fe-7244-4fb4-9b1b-2cf3733f79ce containerID=\"a27b4dc602d3e5a25a0a7caf5d91ef23fbcb5df73400ece3262ecf374fe1f889\" exitCode=0"}
2023-01-27T02:55:58.101Z {"hostname":"ip-10-10-17-70.us-west-2.compute.internal","systemd_unit":"kubelet.service","message":"I0127 02:55:58.101141 4720 kubelet.go:2145] \"SyncLoop (PLEG): event for pod\" pod=\"jobs-default/t-nqz8b-xw2zp\" event=&{ID:32d059fe-7244-4fb4-9b1b-2cf3733f79ce Type:ContainerDied Data:a27b4dc602d3e5a25a0a7caf5d91ef23fbcb5df73400ece3262ecf374fe1f889}"}
It's unclear if there are ways for higher level apis can identify this condition and automatically retry. Maybe, metaflow could retry if a pod if it failed with exit code None
??
In this case it is guaranteed that the pod never started. So restarting it should be safe.
On one of our clusters, a flow with 59 parallel steps was run
--with kubernetes
. 58 out of those ran just fine. But one of the steps failed without any error message.The logs don't seem to have specific information.