Task failed silently when running --with kubernetes

Looks like the pod failed because kubelet errored out with "No sandbox for pod"

2023-01-27T02:55:57.328Z    {"hostname":"ip-10-10-17-70.us-west-2.compute.internal","systemd_unit":"kubelet.service","message":"I0127 02:55:57.328640 4720 operation_generator.go:714] MountVolume.SetUp succeeded for volume \"kube-api-access-z2zxj\" (UniqueName: \"kubernetes.io/projected/03ec9a24-bfa3-4474-a02a-54abe2a74805-kube-api-access-z2zxj\") pod \"t-n9s26-cnmc4\" (UID: \"03ec9a24-bfa3-4474-a02a-54abe2a74805\")"}

2023-01-27T02:55:57.422Z    {"hostname":"ip-10-10-17-70.us-west-2.compute.internal","systemd_unit":"kubelet.service","message":"I0127 02:55:57.422174 4720 kuberuntime_manager.go:484] \"No sandbox for pod can be found. Need to start a new one\" pod=\"jobs-default/t-n9s26-cnmc4\""}

2023-01-27T02:55:58.101Z    {"hostname":"ip-10-10-17-70.us-west-2.compute.internal","systemd_unit":"kubelet.service","message":"I0127 02:55:58.101069 4720 generic.go:296] \"Generic (PLEG): container finished\" podID=32d059fe-7244-4fb4-9b1b-2cf3733f79ce containerID=\"a27b4dc602d3e5a25a0a7caf5d91ef23fbcb5df73400ece3262ecf374fe1f889\" exitCode=0"}

2023-01-27T02:55:58.101Z    {"hostname":"ip-10-10-17-70.us-west-2.compute.internal","systemd_unit":"kubelet.service","message":"I0127 02:55:58.101141 4720 kubelet.go:2145] \"SyncLoop (PLEG): event for pod\" pod=\"jobs-default/t-nqz8b-xw2zp\" event=&{ID:32d059fe-7244-4fb4-9b1b-2cf3733f79ce Type:ContainerDied Data:a27b4dc602d3e5a25a0a7caf5d91ef23fbcb5df73400ece3262ecf374fe1f889}"}

It's unclear if there are ways for higher level apis can identify this condition and automatically retry. Maybe, metaflow could retry if a pod if it failed with exit code None??

In this case it is guaranteed that the pod never started. So restarting it should be safe.

Netflix / metaflow

Task failed silently when running --with kubernetes #1249