Netflix / conductor

Conductor is a microservices orchestration engine.
Apache License 2.0
12.82k stars 2.34k forks source link

DO_WHILE loop does not restart after WAIT task update #3738

Open dpozinen opened 1 year ago

dpozinen commented 1 year ago

Describe the bug After task inside DO_WHILE loop is updated, it takes a long time (2+ minutes) to start the next loop iteration.

Details Conductor version: Persistence implementation: MySQL Queue implementation: MySQL

Detailed issue description What we have is the following setup:

Main WF
  task A
  nested WF
    Loop
      some task A
      some task B
      WAIT task

This WAIT task is being updated via the API, and gets completed almost instantly. But some task A doesn't start for a while. You can see there's about a 2 minute delay, sometimes more.

Screenshot 2023-08-17 at 19 11 55

During my debugging, I've noticed that the execution of the loop doesn't restart until the next sweep cycle, but considering we have only 2 sweeper threads, the mysql queue implementation limits itself to select -ing only 2 messages per query, so this takes some time.

Now, my question is - judging by the code, am I correct to assume that the workflow should actually restart immediately?

My task is indeed a loop task

Task: TaskModeltaskType='WAIT', status=IN_PROGRESS, inputData=until=2023-08-18 20:59, 
referenceTaskName='delay_wait_task__16', retryCount=0, seq=117 ... **iteration=16**, subWorkflowId='null', 
subworkflowChanged=false belonging to Workflow generic_workflow.1/88001d3b-6398-424f-b556-6dae8849919d.RUNNING 
being updated[spanId=c256afeb36cb6fe0, traceId=c256afeb36cb6fe0]

Am I missing something here?

This should really be marked as help_wanted instead of bug

dpozinen commented 1 year ago

I was able to figure it out, so turns out conductor will update/complete the task regardless of which workflowInstanceId you pass as a parameter, as long as it exists. My mistake was that I was passing the Main WF instance id, instead of the nested WF instance id. If I pass the nested one, everything executes as expected.

I don't know if there is any kind of use case where this would be necessary, but this seems like a missing validation issue?