conductor-oss / conductor

Conductor is an event driven orchestration platform
https://conductor-oss.org
Apache License 2.0
17.84k stars 456 forks source link

Conductor waits more than an hour to schedule the next task. #163

Closed adrichem closed 2 months ago

adrichem commented 4 months ago

Describe the bug Tasks with an execution time of several minutes complete successfully, but the next task is not scheduled until approx. an hour and a half later. The worker is running and idle. The task queue has a size of 0 and shows the working polling every second.

Whenever I execute the workflow below, the tasks in fork all run fine, The first tasks in each parallel path of fork_2 complete but the no-op-worker-SET2-1-2 and no-op-worker-SET2-2-2 tasks don't execute. The execution tab in the UI shows them with status 'Not executed'

Details Conductor version: Tried 3.18.0 and 3.19.0

Running based on Running Conductor Using Docker for evaluation and demo purposes.

On the Orkes playground, these type of workflows are running fine. I checked the documentation, but did not find anything that seemed relevant to configure.

Workflow definition: Here's a workflow that demonstrates the issue.

{
  "updateTime": 1715972548277,
  "accessPolicy": {},
  "name": "why_so_long",
  "version": 1,
  "tasks": [
    {
      "name": "fork",
      "taskReferenceName": "fork",
      "inputParameters": {},
      "type": "FORK_JOIN",
      "forkTasks": [
        [
          {
            "name": "no-op-worker",
            "taskReferenceName": "no-op-worker-SET1-1-1",
            "inputParameters": {
              "Time": "00:00:01"
            },
            "type": "SIMPLE",
            "startDelay": 0,
            "optional": false,
            "asyncComplete": false,
            "permissive": false
          },
          {
            "name": "no-op-worker",
            "taskReferenceName": "no-op-worker-SET1-1-2",
            "inputParameters": {
              "Time": "00:00:01"
            },
            "type": "SIMPLE",
            "startDelay": 0,
            "optional": false,
            "asyncComplete": false,
            "permissive": false
          }
        ],
        [
          {
            "name": "no-op-worker",
            "taskReferenceName": "no-op-worker-SET1-2-1",
            "inputParameters": {
              "Time": "00:00:01"
            },
            "type": "SIMPLE",
            "startDelay": 0,
            "optional": false,
            "asyncComplete": false,
            "permissive": false
          },
          {
            "name": "no-op-worker",
            "taskReferenceName": "no-op-worker-SET1-2-2",
            "inputParameters": {
              "Time": "00:00:01"
            },
            "type": "SIMPLE",
            "startDelay": 0,
            "optional": false,
            "asyncComplete": false,
            "permissive": false
          }
        ]
      ],
      "startDelay": 0,
      "optional": false,
      "asyncComplete": false,
      "permissive": false
    },
    {
      "name": "join",
      "taskReferenceName": "join",
      "inputParameters": {},
      "type": "JOIN",
      "startDelay": 0,
      "joinOn": [
        "no-op-worker-SET1-1-1",
        "no-op-worker-SET1-1-2",
        "no-op-worker-SET1-2-1",
        "no-op-worker-SET1-2-2"
      ],
      "optional": false,
      "asyncComplete": false,
      "permissive": false
    },
    {
      "name": "fork_2",
      "taskReferenceName": "fork_2",
      "inputParameters": {},
      "type": "FORK_JOIN",
      "forkTasks": [
        [
          {
            "name": "no-op-worker",
            "taskReferenceName": "no-op-worker-SET2-1-1",
            "inputParameters": {
              "Time": "00:02:00"
            },
            "type": "SIMPLE",
            "startDelay": 0,
            "optional": false,
            "asyncComplete": false,
            "permissive": false
          },
          {
            "name": "no-op-worker",
            "taskReferenceName": "no-op-worker-SET2-1-2",
            "inputParameters": {
              "Time": "00:02:00"
            },
            "type": "SIMPLE",
            "startDelay": 0,
            "optional": false,
            "asyncComplete": false,
            "permissive": false
          }
        ],
        [
          {
            "name": "no-op-worker",
            "taskReferenceName": "no-op-worker-SET2-2-1",
            "inputParameters": {
              "Time": "00:02:00"
            },
            "type": "SIMPLE",
            "startDelay": 0,
            "optional": false,
            "asyncComplete": false,
            "permissive": false
          },
          {
            "name": "no-op-worker",
            "taskReferenceName": "no-op-worker-SET2-2-2",
            "inputParameters": {
              "Time": "00:02:00"
            },
            "type": "SIMPLE",
            "startDelay": 0,
            "optional": false,
            "asyncComplete": false,
            "permissive": false
          }
        ]
      ],
      "startDelay": 0,
      "optional": false,
      "asyncComplete": false,
      "permissive": false
    },
    {
      "name": "join_2",
      "taskReferenceName": "join_2",
      "inputParameters": {},
      "type": "JOIN",
      "startDelay": 0,
      "joinOn": [
        "no-op-worker-SET2-1-1",
        "no-op-worker-SET2-1-2",
        "no-op-worker-SET2-2-1",
        "no-op-worker-SET2-2-2"
      ],
      "optional": false,
      "asyncComplete": false,
      "permissive": false
    }
  ],
  "inputParameters": [],
  "outputParameters": {},
  "schemaVersion": 2,
  "restartable": true,
  "workflowStatusListenerEnabled": false,
  "ownerEmail": "test@example.com",
  "timeoutPolicy": "ALERT_ONLY",
  "timeoutSeconds": 0,
  "variables": {},
  "inputTemplate": {}
}

Task definition:

{
  "createTime": 1715972548160,
  "createdBy": "",
  "accessPolicy": {},
  "name": "no-op-worker",
  "description": "Does nothing",
  "retryCount": 3,
  "timeoutSeconds": 0,
  "inputKeys": [
    "Time"
  ],
  "outputKeys": [],
  "timeoutPolicy": "TIME_OUT_WF",
  "retryLogic": "FIXED",
  "retryDelaySeconds": 60,
  "responseTimeoutSeconds": 3600,
  "inputTemplate": {
    "Time": "00:00:05.00"
  },
  "rateLimitPerFrequency": 0,
  "rateLimitFrequencyInSeconds": 1,
  "ownerEmail": "test@test.com",
  "backoffScaleFactor": 1
}

To Reproduce The no-op-worker sleeps for for requested Time and return completed state. Its instantiated like this: [WorkerTask(taskType: "no-op-worker", batchSize: 100, domain: null, pollIntervalMs: 200, workerId: "deploy-worker")]

Expected behavior I expect the next task to be scheduled for execution when its predecessor completes.

v1r3n commented 4 months ago

can you share how many running workflows do you have in the system? Do you use redis or postgres as the setup?

adrichem commented 4 months ago

Hi @v1r3n, Thanks for the response. There are no other running workflows. Its using redis with the config of the default docker-compose.yaml from the repo

anzerr commented 3 months ago

Had the same problem with "JOIN" taking hours to finish. Changed the isAsync on the task back to "true" fixed the long delay https://github.com/conductor-oss/conductor/pull/120#issuecomment-2063583089

v1r3n commented 3 months ago

Hi @anzerr we are reverting this back to async https://github.com/conductor-oss/conductor/pull/194/files#diff-ebc515b9038973f4691b79e4dbbfec232800c5a87e95cc2ddbd12033b0fd2926R129

adrichem commented 2 months ago

I see this is fixed on latest main.