Netflix / conductor

Conductor is a microservices orchestration engine.
Apache License 2.0
12.81k stars 2.34k forks source link

WAIT Task inside DO_WHILE causing infinite task creation which are already completed #3876

Open appunni-old opened 12 months ago

appunni-old commented 12 months ago

Describe the bug While running the below workflow it goes into infinite loop

Details Conductor version: 3.15.0 Persistence implementation: Postgres and MySQL Queue implementation: MySQL and Postgres Lock: Redis
Workflow definition:

{
  "createTime": 1701489520469,
  "createdBy": "owner@email.com",
  "updatedBy": "owner@email.com",
  "accessPolicy": {},
  "name": "test_do_while",
  "description": "Workflow details",
  "version": 1,
  "tasks": [
    {
      "name": "default__do_while",
      "taskReferenceName": "task_1__loop_databricks",
      "inputParameters": {},
      "type": "DO_WHILE",
      "startDelay": 0,
      "optional": false,
      "asyncComplete": false,
      "loopCondition": "if ($.task_1__loop_databricks['iteration'] < 200) { true; } else { false; }",
      "loopOver": [
        {
          "name": "default__sleep",
          "taskReferenceName": "task_1__wait_databricks",
          "inputParameters": {
            "duration": "20 seconds",
            "tenantId": "csit"
          },
          "type": "WAIT",
          "startDelay": 0,
          "optional": false,
          "asyncComplete": false
        }
      ]
    }
  ],
  "inputParameters": [],
  "outputParameters": {},
  "schemaVersion": 2,
  "restartable": true,
  "workflowStatusListenerEnabled": false,
  "ownerEmail": "owner@email.com",
  "timeoutPolicy": "ALERT_ONLY",
  "timeoutSeconds": 0,
  "variables": {},
  "inputTemplate": {}
}

Error in conductor server

conductor-server          | 2023-12-02 04:56:48.224 ERROR 13 --- [m-task-worker-8] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 94c1d30a-aef6-4861-be18-4fbfcd03743c could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.227 ERROR 13 --- [-task-worker-11] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 8c683b28-0c10-42dd-894b-2aebead3e3e8 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.234 ERROR 13 --- [m-task-worker-9] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 976eb165-97af-4451-800b-b506341bd938 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.250 ERROR 13 --- [-task-worker-10] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 86de75ad-195b-4d18-86f6-7b1280702751 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.293 ERROR 13 --- [-task-worker-12] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 822de4fa-2291-4516-82b9-bd6ef7f8b0ac could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.409 ERROR 13 --- [-task-worker-13] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 7ed85099-e58f-45e1-845a-c44e141113e5 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.431 ERROR 13 --- [-task-worker-14] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 784e801d-fcb6-488a-b575-6d476c86a6aa could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.466 ERROR 13 --- [-task-worker-15] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 75e2b4ac-225e-447b-b8ff-b9ab5164c642 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.744 ERROR 13 --- [-task-worker-16] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: a2a4b709-7c51-4b24-a26f-f49cffbcf877 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:48.880 ERROR 13 --- [-task-worker-17] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: 73526e0e-0ede-4bda-8e4a-9d77d250e947 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.103 ERROR 13 --- [-task-worker-18] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: a47dded2-3d7c-4d25-82eb-021cdd19f288 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.120 ERROR 13 --- [-task-worker-20] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: a725f7d5-da1c-4d42-a453-0550592f8b06 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.121 ERROR 13 --- [-task-worker-21] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: ab0319c0-cb0f-49f5-8a7e-99c04fef1809 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.123 ERROR 13 --- [-task-worker-22] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: abdb849e-6fd9-43cd-b924-5415a933e6bb could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.189 ERROR 13 --- [-task-worker-23] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: b2a69eb2-bea1-4764-b986-ad75bb82e9dc could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.191 ERROR 13 --- [-task-worker-24] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: c9410bb4-d96e-4ca9-a6be-bd45e6f0ea53 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.192 ERROR 13 --- [m-task-worker-1] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: b3bfbe2f-9c32-41aa-89aa-08bed04c47ce could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.194 ERROR 13 --- [m-task-worker-2] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: b4b10f44-1530-4a93-971d-6025762d837b could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.196 ERROR 13 --- [m-task-worker-3] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: c5496ec6-2730-43eb-a265-86b06ae35807 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.197 ERROR 13 --- [-task-worker-23] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: c3369393-abb7-4c1f-907d-4d02eda5e9a4 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.207 ERROR 13 --- [m-task-worker-5] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: c26ffe62-4c7f-4d64-a1d5-ef203afc4272 could not be found while executing WAIT
conductor-server          | 2023-12-02 04:56:49.210 ERROR 13 --- [m-task-worker-6] c.n.c.c.e.AsyncSystemTaskExecutor        : TaskId: b6b9aef8-e941-48e2-a6

To Reproduce Just goto UI http://localhost:5000 Create the above task definition Goto workbench Just trigger this workflow WARNING - This creates an Infinite loop situation only use this with local conductor setup which can be deleted

Expected behavior Loop runs and waits for 20 seconds between loop

Screenshots The workflow is stuck not moving forward.

Additional context Add any other context about the problem here.

appunni-old commented 12 months ago

Not able to replicate in orkes platform

appunni-old commented 12 months ago

I debugged it by running line by line, attaching first lines as well

595060 [sweeper-thread-24] INFO  com.netflix.conductor.core.reconciliation.WorkflowRepairService [] - Task 46abe269-5daf-403a-9b15-cbd7878b8bed in workflow 7d137e5b-304e-449c-9607-6413bfee8fd0 re-queued for repairs
667288 [HikariPool-1 housekeeper] WARN  com.zaxxer.hikari.pool.HikariPool [] - HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=1m16s793ms).
686827 [system-task-worker-2] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 1445ba4c-0bd5-4826-a359-984fd4da86a5 could not be found while executing WAIT
692015 [system-task-worker-3] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 05cbf978-86e3-48ef-b5cf-52b481edd5f5 could not be found while executing WAIT
699409 [system-task-worker-4] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 95ffee82-0cc9-468a-8ce8-af7b1d8438c1 could not be found while executing WAIT
700895 [system-task-worker-5] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: d7f9d0a7-3525-4eff-a07a-179bc57ab349 could not be found while executing WAIT
701862 [system-task-worker-7] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 75811fa6-ec79-40c3-9136-88b33a3a53f3 could not be found while executing WAIT
702397 [system-task-worker-6] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 80b2e11a-4f28-4f28-8737-26d1d7abd010 could not be found while executing WAIT
702762 [system-task-worker-9] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 58055094-5e0d-4613-beb6-078f940994fa could not be found while executing WAIT

Oh sorry this is broken, I ran it in orkes platform, it went to same loop. I regret now, I should have been more careful. Can some one help ?

appunni-old commented 12 months ago

And I definitely think it's something to do with the config, because I created same via UI and it worked completely fine. In orkes default cluster task limit was 1000, but this created 7552. I terminated the workflow. Otherwise it would have kept running.

appunni-old commented 12 months ago

Issue Identified: This happens when task reference name has double underscore. Which means this will evaluate false. We should have validation when accepting taskReference names not to have double underscore on workflow definition or on the Start workflow API

        for (TaskModel t : workflow.getTasks()) {
            if (doWhileTaskModel
                            .getWorkflowTask()
                            .has(TaskUtils.removeIterationFromTaskRefName(t.getReferenceTaskName()))
                    && !doWhileTaskModel.getReferenceTaskName().equals(t.getReferenceTaskName())
                    && doWhileTaskModel.getIteration() == t.getIteration()) {
                relevantTask = relevantTasks.get(t.getReferenceTaskName());
                if (relevantTask == null || t.getRetryCount() > relevantTask.getRetryCount()) {
                    relevantTasks.put(t.getReferenceTaskName(), t);
                }
            }
        }

TaskUtils.removeIterationFromTaskRefName(t.getReferenceTaskName())

Is the culprit as it tries to fetch the task id by splitting DELIMITER ie "__".

    public static String removeIterationFromTaskRefName(String referenceTaskName) {
        String[] tokens = referenceTaskName.split(TaskUtils.LOOP_TASK_DELIMITER);
        return tokens.length > 0 ? tokens[0] : referenceTaskName;
    }

This leads to an infinite loop condition, creating infinite tasks