Closed samzong closed 2 months ago
I0906 09:01:16.394403 7 kubeevents.go:60] event distributed-sleep-job-worker-2.17f29c21d6ebc4ac is too old, ignore it I0906 09:01:16.409373 7 kubeevents.go:60] event distributed-sleep-job-worker-1.17f29c21d7df2f26 is too old, ignore it I0906 09:01:16.416583 7 kubeevents.go:60] event distributed-sleep-job-worker-3.17f29c21d8655c07 is too old, ignore it I0906 09:01:46.076922 7 recovery.go:135] recover controller received event: {TargetType:pod Namespace:default Name:distributed-sleep-job-worker-0 EventType:1 Message:container pytorch terminated with error: , exit code: 137} I0906 09:01:46.100182 7 recovery.go:85] restart job default/distributed-sleep-job successfully I0906 09:01:47.081255 7 kubeevents.go:60] event distributed-sleep-job-worker-0.17f29c28fb502705 is too old, ignore it
This is event logs.
~ k get events distributed-sleep-job-worker-1.17f29b696540725b -o yaml action: Binding apiVersion: v1 eventTime: "2024-09-06T08:48:04.208911Z" firstTimestamp: null involvedObject: apiVersion: v1 kind: Pod name: distributed-sleep-job-worker-1 namespace: default resourceVersion: "28354" uid: 6f464dcd-4713-4b53-b77d-b8ffe56e0533 kind: Event lastTimestamp: null message: Successfully assigned default/distributed-sleep-job-worker-1 to orbstack metadata: creationTimestamp: "2024-09-06T08:48:04Z" name: distributed-sleep-job-worker-1.17f29b696540725b namespace: default resourceVersion: "28360" uid: bcea10cd-e1d0-49e9-861c-85319cb58f08 reason: Scheduled reportingComponent: default-scheduler reportingInstance: default-scheduler-orbstack source: {} type: Normal
fixed .
This is event logs.