conductor-oss / conductor

Conductor is an event driven orchestration platform
https://conductor-oss.org
Apache License 2.0
18.13k stars 499 forks source link

Two Workflow Instances Open on Failure #284

Open lironleizer opened 1 month ago

lironleizer commented 1 month ago

Description: Since upgrading to Conductor 3.16.0, we have encountered unusual behavior in one of our workflows. The workflow is defined as follows: conductor issue

When the WAIT EVENT receives a message, the workflow proceeds to the TERMINATE TASK. However, occasionally we observe that two failure workflows are opened. These failure workflows are nearly identical, but one shows an ownerApp as "conductor," while the other has an empty ownerApp.

Main workflow:

{

  "ownerApp": "",

  "createTime": 1728474259443,

  "updateTime": 1728477038331,

  "status": "FAILED",

  "endTime": 1728477038331,

  "workflowId": "179dd481-8e4b-4e9d-905d-8f37f9b7c577",

  "tasks": […]

}

Output:

{

  "output": "",

  "conductor.failure_workflow": "5974405e-e4b6-4924-b0bf-fbcec3827e2b"

}

failure workflow 1:

{

  "ownerApp": "conductor",

  "createTime": 1728477038313,

  "updateTime": 1728477039058,

  "status": "COMPLETED",

  "endTime": 1728477039058,

  "workflowId": "5974405e-e4b6-4924-b0bf-fbcec3827e2b",

  "tasks": […]

}

failure workflow 2:

{

  "ownerApp": "",

  "createTime": 1728477038229,

  "updateTime": 1728477039145,

  "status": "COMPLETED",

  "endTime": 1728477039145,

  "workflowId": "c02b2cb8-d6c4-4aaa-bc1a-3c04a1585d80",

  "tasks": […]

}

From the main workflow output, the failure workflow ID corresponds to the one with the ownerApp set to "conductor." The timestamps show that the two workflows are opened just a few milliseconds apart.

Here are the relevant logs for further insight:

image (1)

Based on these logs, we suspect that this behavior may be caused by race conditions on the workflow's status. It seems related to the sweeper thread triggering an action while the event is already being processed by the main flow.

Expected Behavior: Only one failure workflow instance should be opened when the workflow fails.

Potential Cause: The issue appears to be caused by a race condition in the decider queue, specifically around status updates when the workflow progresses from the WAIT EVENT to the TERMINATE TASK. The sweeper thread may be triggering actions prematurely, while the event processing is still ongoing in the main workflow flow.

lironleizer commented 1 month ago

@dilip-lukose @v1r3n please advise.