BaizeAI / kcover

Apache License 2.0
25 stars 1 forks source link

kcover-conrotoller logs " is tool old, ingore it", recovery failed #4

Closed samzong closed 2 months ago

samzong commented 2 months ago
I0906 09:01:16.394403       7 kubeevents.go:60] event distributed-sleep-job-worker-2.17f29c21d6ebc4ac is too old, ignore it
I0906 09:01:16.409373       7 kubeevents.go:60] event distributed-sleep-job-worker-1.17f29c21d7df2f26 is too old, ignore it
I0906 09:01:16.416583       7 kubeevents.go:60] event distributed-sleep-job-worker-3.17f29c21d8655c07 is too old, ignore it
I0906 09:01:46.076922       7 recovery.go:135] recover controller received event: {TargetType:pod Namespace:default Name:distributed-sleep-job-worker-0 EventType:1 Message:container pytorch terminated with error: , exit code: 137}
I0906 09:01:46.100182       7 recovery.go:85] restart job default/distributed-sleep-job successfully
I0906 09:01:47.081255       7 kubeevents.go:60] event distributed-sleep-job-worker-0.17f29c28fb502705 is too old, ignore it

This is event logs.

~ k get events distributed-sleep-job-worker-1.17f29b696540725b -o yaml
action: Binding
apiVersion: v1
eventTime: "2024-09-06T08:48:04.208911Z"
firstTimestamp: null
involvedObject:
  apiVersion: v1
  kind: Pod
  name: distributed-sleep-job-worker-1
  namespace: default
  resourceVersion: "28354"
  uid: 6f464dcd-4713-4b53-b77d-b8ffe56e0533
kind: Event
lastTimestamp: null
message: Successfully assigned default/distributed-sleep-job-worker-1 to orbstack
metadata:
  creationTimestamp: "2024-09-06T08:48:04Z"
  name: distributed-sleep-job-worker-1.17f29b696540725b
  namespace: default
  resourceVersion: "28360"
  uid: bcea10cd-e1d0-49e9-861c-85319cb58f08
reason: Scheduled
reportingComponent: default-scheduler
reportingInstance: default-scheduler-orbstack
source: {}
type: Normal
kebe7jun commented 2 months ago

fixed .