Netflix / conductor

Conductor is a microservices orchestration engine.
Apache License 2.0
12.82k stars 2.34k forks source link

Join Takes Too Long Although Tasks Are Complete #3472

Open john-larson opened 1 year ago

john-larson commented 1 year ago

Joins after forks take about 30-40 seconds even though the tasks they are waiting on are all complete. And this happens consistently every time the workflow runs.

Here is a screenshot of the timeline. The task that took 32 seconds is the join task.

image

What can the reason for this be?

Details Conductor version: 3.13.2 Persistence implementation: Postgres Queue implementation: Postgres Lock: None

Task definition:

{
  "name": "join_after_queries",
  "taskReferenceName": "join_after_queries",
  "inputParameters": {},
  "type": "JOIN",
  "startDelay": 0,
  "joinOn": [
    "get_a",
    "get_b",
    "get_c",
    "get_d",
    "get_e",
    "get_f"
  ],
  "optional": false,
  "asyncComplete": false
}

Expected behavior Join task should complete as soon as the tasks it is waiting on are complete.

v1r3n commented 1 year ago

Hi @john-larson do you have locks enabled?

conductor.app.workflowExecutionLockEnabled=true
john-larson commented 1 year ago

Hi @v1r3n, we do not have that setting in our config file.

john-larson commented 1 year ago

I have also come across the following discussion on this topic: https://github.com/Netflix/conductor/discussions/3436

It looks like this might be a bug introduced lately.

Dyson-Ido commented 1 year ago

I also have this issue and if it joins zero task then the "JOIN" task itself will execute immediately. However, if it joints at least one task then it will take about 30 seconds even the waited task's completed long ago. Does not anyone look into this issue? It‘s definitely a bug. @manan164 , could you help investigate this issue?

manan164 commented 1 year ago

Hi @Dyson-Ido , Can you please try setting conductor.app.systemTaskWorkerCallbackDuration property to 1s

Dyson-Ido commented 1 year ago

Hi @Dyson-Ido , Can you please try setting conductor.app.systemTaskWorkerCallbackDuration property to 1s

I add this property in server configure file but still it took extra 30 seconds to complete JOIN task. @manan164

Dyson-Ido commented 1 year ago

Hi @Dyson-Ido , Can you please try setting conductor.app.systemTaskWorkerCallbackDuration property to 1s @manan164 Sorry for the false report, it actually works by setting the above property to 1s. My first try failed because I didn't update the configure file in docker container. After update it works then. Could you please tell me which documents could I reference to get all the information of properties such as conductor.app.systemTaskWorkerCallbackDuration. Otherwise as developer I don't know which properties to tune and what those properties means. And please add this property modification to document site in case other developers encounter this issue. I don't know whether it's by design or we're using this property to work around this issue?

manan164 commented 1 year ago

Hi @Dyson-Ido , that is good feedback. Here is the information about this property. Ideally this class all property information regarding the conductor platform. Let me know if this help. We will create a new page for all property information.