Two Workflow Instances Open on Failure

Description: Since upgrading to Conductor 3.16.0, we have encountered unusual behavior in one of our workflows. The workflow is defined as follows: conductor issue

When the WAIT EVENT receives a message, the workflow proceeds to the TERMINATE TASK. However, occasionally we observe that two failure workflows are opened. These failure workflows are nearly identical, but one shows an ownerApp as "conductor," while the other has an empty ownerApp.

Main workflow:

{

  "ownerApp": "",

  "createTime": 1728474259443,

  "updateTime": 1728477038331,

  "status": "FAILED",

  "endTime": 1728477038331,

  "workflowId": "179dd481-8e4b-4e9d-905d-8f37f9b7c577",

  "tasks": […]

…

}

Output:

{

  "output": "",

  "conductor.failure_workflow": "5974405e-e4b6-4924-b0bf-fbcec3827e2b"

}

failure workflow 1:

{

  "ownerApp": "conductor",

  "createTime": 1728477038313,

  "updateTime": 1728477039058,

  "status": "COMPLETED",

  "endTime": 1728477039058,

  "workflowId": "5974405e-e4b6-4924-b0bf-fbcec3827e2b",

  "tasks": […]

…

}

failure workflow 2:

{

  "ownerApp": "",

  "createTime": 1728477038229,

  "updateTime": 1728477039145,

  "status": "COMPLETED",

  "endTime": 1728477039145,

  "workflowId": "c02b2cb8-d6c4-4aaa-bc1a-3c04a1585d80",

  "tasks": […]

…

}

From the main workflow output, the failure workflow ID corresponds to the one with the ownerApp set to "conductor." The timestamps show that the two workflows are opened just a few milliseconds apart.

Here are the relevant logs for further insight:

image (1)

Based on these logs, we suspect that this behavior may be caused by race conditions on the workflow's status. It seems related to the sweeper thread triggering an action while the event is already being processed by the main flow.

Expected Behavior: Only one failure workflow instance should be opened when the workflow fails.

Potential Cause: The issue appears to be caused by a race condition in the decider queue, specifically around status updates when the workflow progresses from the WAIT EVENT to the TERMINATE TASK. The sweeper thread may be triggering actions prematurely, while the event processing is still ongoing in the main workflow flow.

conductor-oss / conductor

Two Workflow Instances Open on Failure #284