After workflow repaired task is executed two times

astelmashenko commented 1 year ago

Describe the bug We notices that task is executed twice sometimes. After we enabled debug logs we found out that after WorkflowRepairService re-queued task for some reason the task was exeucted two times:

INFO  2022-07-04T07:56:38,583 147034  com.netflix.conductor.core.reconciliation.WorkflowRepairService [sweeper-thread-1]  Task 425d9c94-dc30-441b-b21b-73ccc5118829 in workflow d6e20f06-c884-4c25-81a4-4a7c0eb3827e re-queued for repairs

DEBUG 2022-07-04T07:56:42,994 151445  com.netflix.conductor.contribs.tasks.http.HttpTask  [system-task-worker-1]  Response: 200, {bills={partyAUTHOR={biId=5200737, status=OPEN}, partyUNIVERSITY={biId=5200740, status=OPEN}}}, task:425d9c94-dc30-441b-b21b-73ccc5118829

DEBUG 2022-07-04T07:56:42,994 151445  com.netflix.conductor.contribs.tasks.http.HttpTask  [system-task-worker-0]  Response: 200, {bills={partyAUTHOR={biId=5200738, status=OPEN}, partyUNIVERSITY={biId=5200739, status=OPEN}}}, task:425d9c94-dc30-441b-b21b-73ccc5118829

What does WorkflowRepairService do and do we need it at all? Why does it happen even when we have lock service? Thanks.

Details Conductor version: 3.7.2 Persistence implementation: Postgres Queue implementation: Postgres Lock: Redis

To Reproduce This happens from time-to-time, we did not find steps to reproduce

Expected behavior HTTP task must be executed only once.

The original issue was opened condcutor-community https://github.com/Netflix/conductor-community/issues/70 But nobody responded in months

manan164 commented 1 year ago

Hi @astelmashenko , WorkflowRepairs checks for the taskId before pushing anything into the queue. Are you using locks in your configuration? There is a high chance that workflow execution is not guarded by locks so the task may be picked up by two different threads.

astelmashenko commented 1 year ago

@manan164 , Yes we are using lock (Redis). What I have in mind is upgrade of conductor. E.g. we fixed something in our custom task and re-deploying conductor with thousands of workflows. How does it stop, e.g. stop decider firtst, wait for complete of all running tasks, stop connections and shutdown conductor. The question: Is the process of shutdown deterministic, is there evidence that it shutdowns gracefully?

Netflix / conductor

After workflow repaired task is executed two times #3618