AmadeusITGroup / workflow-controller

Kubernetes workflow controller
Apache License 2.0
24 stars 15 forks source link

Segmentation fault on stale jobs created by workflow after workflow-controller restart. #54

Closed jhart99 closed 6 years ago

jhart99 commented 6 years ago

The latest version in the docker repository (codefresh/workflow-controller:latest referenced by the k8s deployment) has a segfault if there are stale workflow jobs left in kubernetes. Workflow-controller will keep crashing and restarting after it gets into this state.

Steps

  1. Submit some workflows.
  2. Kill workflow-controller after the workflows create jobs and pods.
  3. Restart workflow-controller
  4. Some of the workflows will result in the following crash.

I was able to resolve this by purging all of the jobs that workflow-controller made before restarting it. All the jobs were in the same context so I didn't filter them from any other jobs.

kubectl get jobs | awk '{printf "job/%s ", $1}' | xargs -n 10 -P 8 kubectl delete

Error from the logs:

E0119 15:02:43.425448       1 controller.go:276] workflow default/8db8b247-05b8-46ca-8791-ecf846da2c7f created with non empty status. Going to be removed
E0119 15:02:43.857118       1 controller.go:186] unable to get Workflow default/8db8b247-05b8-46ca-8791-ecf846da2c7f: workflow.dag.example.com "8db8b247-05b8-46ca-8791-ecf846da2c7f" not found. Maybe deleted
E0119 15:03:40.972611       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/github.com/sdminonne/workflow-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/github.com/sdminonne/workflow-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/sdminonne/workflow-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:509
/usr/local/go/src/runtime/panic.go:491
/usr/local/go/src/runtime/panic.go:63
/usr/local/go/src/runtime/signal_unix.go:367
/go/src/github.com/sdminonne/workflow-controller/pkg/controller/controller.go:526
/go/src/github.com/sdminonne/workflow-controller/pkg/controller/controller.go:452
/go/src/github.com/sdminonne/workflow-controller/pkg/controller/controller.go:247
/go/src/github.com/sdminonne/workflow-controller/pkg/controller/controller.go:150
/go/src/github.com/sdminonne/workflow-controller/pkg/controller/controller.go:140
/go/src/github.com/sdminonne/workflow-controller/pkg/controller/controller.go:132
/go/src/github.com/sdminonne/workflow-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/sdminonne/workflow-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/sdminonne/workflow-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:2337
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xf02829]

goroutine 83 [running]:
github.com/sdminonne/workflow-controller/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    /go/src/github.com/sdminonne/workflow-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x111
panic(0x1013aa0, 0x1953db0)
    /usr/local/go/src/runtime/panic.go:491 +0x283
github.com/sdminonne/workflow-controller/pkg/controller.(*WorkflowController).manageWorkflowJobStep(0xc4200c1220, 0xc4203f3500, 0xc420efb2d0, 0x5, 0xc421348380, 0x0)
    /go/src/github.com/sdminonne/workflow-controller/pkg/controller/controller.go:526 +0x7c9
github.com/sdminonne/workflow-controller/pkg/controller.(*WorkflowController).manageWorkflow(0xc4200c1220, 0xc4203f3500, 0xd8717f3cf, 0x1963d00)
    /go/src/github.com/sdminonne/workflow-controller/pkg/controller/controller.go:452 +0x187
github.com/sdminonne/workflow-controller/pkg/controller.(*WorkflowController).sync(0xc4200c1220, 0xc421192480, 0x2c, 0x0, 0x0)
    /go/src/github.com/sdminonne/workflow-controller/pkg/controller/controller.go:247 +0xc78
github.com/sdminonne/workflow-controller/pkg/controller.(*WorkflowController).processNextItem(0xc4200c1220, 0x7f3000)
    /go/src/github.com/sdminonne/workflow-controller/pkg/controller/controller.go:150 +0xd2
github.com/sdminonne/workflow-controller/pkg/controller.(*WorkflowController).runWorker(0xc4200c1220)
    /go/src/github.com/sdminonne/workflow-controller/pkg/controller/controller.go:140 +0x2b
github.com/sdminonne/workflow-controller/pkg/controller.(*WorkflowController).(github.com/sdminonne/workflow-controller/pkg/controller.runWorker)-fm()
    /go/src/github.com/sdminonne/workflow-controller/pkg/controller/controller.go:132 +0x2a
github.com/sdminonne/workflow-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc420296f40)
    /go/src/github.com/sdminonne/workflow-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x5e
github.com/sdminonne/workflow-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc420296f40, 0x3b9aca00, 0x0, 0x1, 0xc42019c4e0)
    /go/src/github.com/sdminonne/workflow-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
github.com/sdminonne/workflow-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc420296f40, 0x3b9aca00, 0xc42019c4e0)
    /go/src/github.com/sdminonne/workflow-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/sdminonne/workflow-controller/pkg/controller.(*WorkflowController).Run
    /go/src/github.com/sdminonne/workflow-controller/pkg/controller/controller.go:132 +0x100
sdminonne commented 6 years ago

@jhart99 thanks for reporting this. I'm going to have a look.

sdminonne commented 6 years ago

@jhart99 having hard time reproduce the problem. Do you have logs for this? For example, running the controller with --v=6 ? Keep trying, anyway, Thanks again

jhart99 commented 6 years ago

I can get this to happen pretty consistently with the codefresh/workflow-controller docker container. I recompiled from the latest sources and cannot reproduce this, so I'll close it.

sdminonne commented 6 years ago

@jhart99 obviously we don't control codefresh/workflow-controller instance. Thanks