Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.24k stars 774 forks source link

Task failed silently when running --with kubernetes #1249

Open shrinandj opened 1 year ago

shrinandj commented 1 year ago

On one of our clusters, a flow with 59 parallel steps was run --with kubernetes. 58 out of those ran just fine. But one of the steps failed without any error message.

The logs don't seem to have specific information.

2023-01-26 18:55:59.342 [345/analyzecreatures/9146 (pid 72172)] [pod t-n9s26-cnmc4] Downloading code package...
2023-01-26 18:56:00.676 [345/analyzecreatures/9146 (pid 72172)] [pod t-n9s26-cnmc4] Code package downloaded.
2023-01-26 18:56:01.014 [345/analyzecreatures/9146 (pid 72172)] [pod t-n9s26-cnmc4] Task is starting.
2023-01-26 18:56:20.526 [345/analyzecreatures/9146 (pid 72172)] [pod t-n9s26-cnmc4] Task finished with exit code None.
shrinandj commented 1 year ago

Looks like the pod failed because kubelet errored out with "No sandbox for pod"

2023-01-27T02:55:57.328Z    {"hostname":"ip-10-10-17-70.us-west-2.compute.internal","systemd_unit":"kubelet.service","message":"I0127 02:55:57.328640 4720 operation_generator.go:714] MountVolume.SetUp succeeded for volume \"kube-api-access-z2zxj\" (UniqueName: \"kubernetes.io/projected/03ec9a24-bfa3-4474-a02a-54abe2a74805-kube-api-access-z2zxj\") pod \"t-n9s26-cnmc4\" (UID: \"03ec9a24-bfa3-4474-a02a-54abe2a74805\")"}

2023-01-27T02:55:57.422Z    {"hostname":"ip-10-10-17-70.us-west-2.compute.internal","systemd_unit":"kubelet.service","message":"I0127 02:55:57.422174 4720 kuberuntime_manager.go:484] \"No sandbox for pod can be found. Need to start a new one\" pod=\"jobs-default/t-n9s26-cnmc4\""}

2023-01-27T02:55:58.101Z    {"hostname":"ip-10-10-17-70.us-west-2.compute.internal","systemd_unit":"kubelet.service","message":"I0127 02:55:58.101069 4720 generic.go:296] \"Generic (PLEG): container finished\" podID=32d059fe-7244-4fb4-9b1b-2cf3733f79ce containerID=\"a27b4dc602d3e5a25a0a7caf5d91ef23fbcb5df73400ece3262ecf374fe1f889\" exitCode=0"}

2023-01-27T02:55:58.101Z    {"hostname":"ip-10-10-17-70.us-west-2.compute.internal","systemd_unit":"kubelet.service","message":"I0127 02:55:58.101141 4720 kubelet.go:2145] \"SyncLoop (PLEG): event for pod\" pod=\"jobs-default/t-nqz8b-xw2zp\" event=&{ID:32d059fe-7244-4fb4-9b1b-2cf3733f79ce Type:ContainerDied Data:a27b4dc602d3e5a25a0a7caf5d91ef23fbcb5df73400ece3262ecf374fe1f889}"}

It's unclear if there are ways for higher level apis can identify this condition and automatically retry. Maybe, metaflow could retry if a pod if it failed with exit code None??

In this case it is guaranteed that the pod never started. So restarting it should be safe.