flant / shell-operator

Shell-operator is a tool for running event-driven scripts in a Kubernetes cluster
https://flant.github.io/shell-operator/
Apache License 2.0
2.44k stars 214 forks source link

Port `waitForSynchronization` feature from addon-operator #212

Open diafour opened 3 years ago

diafour commented 3 years ago

The problem:

- name: pods-in-separate-queue
  apiVersion: v1
  kind: Pod
  # default behaviour
  executeHookOnSynchronization: true
  executeHookOnEvent: ["Added"]
  queue: pods-queue

If this hook fails on Synchronization, "Added" tasks will be queued and executed in the "pods" queue, despite the "Synchronization" task restarting in the "main" queue.

Logs:

{"binding":"pods-in-separate-queue","event":"kubernetes","hook":"with-error.sh","level":"error","msg":"Hook failed. Will retry after delay. Failed count is 14. Error: with-error.sh FAILED: exit status 1","queue":"main","task":"HookRun","time":"2020-11-18T09:46:19Z"}
...
{"binding":"schedule","event.id":"af7e05c0-729f-411e-92f9-d9cfe4f1deca","level":"info","msg":"queue task HookRun:pods-in-separate-queue:kubernetes:with-error.sh:pods-queue","operator.component":"handleEvents","queue":"pods-queue","task.id":"663cc695-7407-4210-8467-76d5b99a328a","time":"2020-11-18T09:46:49Z"}
...
{"binding":"pods-in-separate-queue","event":"kubernetes","hook":"with-error.sh","level":"info","msg":"Execute hook","queue":"pods-queue","task":"HookRun","time":"2020-11-18T09:46:50Z"}
{"binding":"pods-in-separate-queue","event":"kubernetes","hook":"with-error.sh","level":"info","msg":"BC: Got Event Added for 'pods-in-separate-queue'","output":"stdout","queue":"pods-queue","task":"HookRun","time":"2020-11-18T09:46:51Z"}
...
{"binding":"pods-in-separate-queue","event":"kubernetes","hook":"with-error.sh","level":"info","msg":"BC: Got Synchronization for 'pods-queue' with 12 objects","output":"stdout","queue":"main","task":"HookRun","time":"2020-11-18T09:46:51Z"}
...
{"binding":"pods-in-separate-queue","event":"kubernetes","hook":"with-error.sh","level":"error","msg":"Hook failed. Will retry after delay. Failed count is 15. Error: with-error.sh FAILED: exit status 1","queue":"main","task":"HookRun","time":"2020-11-18T09:46:51Z"}

This problem is already solved in addon-operator by waitForSynchronization option. See https://github.com/flant/addon-operator/issues/111.

diafour commented 3 years ago

250 brings some changes: "Added" tasks will not be queued until successful Synchronization.

I think we can revisit this for the 1.1.0 version to implement "Synchronization" in parallel queues.