Open appunni-old opened 12 months ago
Not able to replicate in orkes platform
I debugged it by running line by line, attaching first lines as well
595060 [sweeper-thread-24] INFO com.netflix.conductor.core.reconciliation.WorkflowRepairService [] - Task 46abe269-5daf-403a-9b15-cbd7878b8bed in workflow 7d137e5b-304e-449c-9607-6413bfee8fd0 re-queued for repairs
667288 [HikariPool-1 housekeeper] WARN com.zaxxer.hikari.pool.HikariPool [] - HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=1m16s793ms).
686827 [system-task-worker-2] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 1445ba4c-0bd5-4826-a359-984fd4da86a5 could not be found while executing WAIT
692015 [system-task-worker-3] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 05cbf978-86e3-48ef-b5cf-52b481edd5f5 could not be found while executing WAIT
699409 [system-task-worker-4] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 95ffee82-0cc9-468a-8ce8-af7b1d8438c1 could not be found while executing WAIT
700895 [system-task-worker-5] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: d7f9d0a7-3525-4eff-a07a-179bc57ab349 could not be found while executing WAIT
701862 [system-task-worker-7] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 75811fa6-ec79-40c3-9136-88b33a3a53f3 could not be found while executing WAIT
702397 [system-task-worker-6] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 80b2e11a-4f28-4f28-8737-26d1d7abd010 could not be found while executing WAIT
702762 [system-task-worker-9] ERROR com.netflix.conductor.core.execution.AsyncSystemTaskExecutor [] - TaskId: 58055094-5e0d-4613-beb6-078f940994fa could not be found while executing WAIT
Oh sorry this is broken, I ran it in orkes platform, it went to same loop. I regret now, I should have been more careful. Can some one help ?
And I definitely think it's something to do with the config, because I created same via UI and it worked completely fine. In orkes default cluster task limit was 1000, but this created 7552. I terminated the workflow. Otherwise it would have kept running.
Issue Identified: This happens when task reference name has double underscore. Which means this will evaluate false. We should have validation when accepting taskReference names not to have double underscore on workflow definition or on the Start workflow API
for (TaskModel t : workflow.getTasks()) {
if (doWhileTaskModel
.getWorkflowTask()
.has(TaskUtils.removeIterationFromTaskRefName(t.getReferenceTaskName()))
&& !doWhileTaskModel.getReferenceTaskName().equals(t.getReferenceTaskName())
&& doWhileTaskModel.getIteration() == t.getIteration()) {
relevantTask = relevantTasks.get(t.getReferenceTaskName());
if (relevantTask == null || t.getRetryCount() > relevantTask.getRetryCount()) {
relevantTasks.put(t.getReferenceTaskName(), t);
}
}
}
TaskUtils.removeIterationFromTaskRefName(t.getReferenceTaskName())
Is the culprit as it tries to fetch the task id by splitting DELIMITER ie "__".
public static String removeIterationFromTaskRefName(String referenceTaskName) {
String[] tokens = referenceTaskName.split(TaskUtils.LOOP_TASK_DELIMITER);
return tokens.length > 0 ? tokens[0] : referenceTaskName;
}
This leads to an infinite loop condition, creating infinite tasks
Describe the bug While running the below workflow it goes into infinite loop
Details Conductor version: 3.15.0 Persistence implementation: Postgres and MySQL Queue implementation: MySQL and Postgres Lock: Redis
Workflow definition:
Error in conductor server
To Reproduce Just goto UI http://localhost:5000 Create the above task definition Goto workbench Just trigger this workflow WARNING - This creates an Infinite loop situation only use this with local conductor setup which can be deleted
Expected behavior Loop runs and waits for 20 seconds between loop
Screenshots The workflow is stuck not moving forward.
Additional context Add any other context about the problem here.