conductor-oss / conductor

Conductor is an event driven orchestration platform
https://conductor-oss.org
Apache License 2.0
17.45k stars 448 forks source link

Workflows executions are getting stuck #209

Open arorashivam opened 2 months ago

arorashivam commented 2 months ago

Describe the bug Workflow executions are getting stuck due to tasks taking too long to schedule.

Further debugging details:

  1. In sweeper flow, If a task is in SCHEDULED state, the un-ack time is set as workflowTimeout if taskDefinition is not present. In other words the sweeper will now only sweep this workflow after workflowTimeout.
  2. Note: I am not sure if we re-set the un-ack timeout once task moves from SCHEDULED to IN_PROGRESS
  3. Now a workflow execution whenever reaches a state where it depends on sweeper to trigger the decide would remain stuck.

Details Conductor version: 3.20.0 Persistence implementation: Postgres Queue implementation: Dynoqueues Lock: Redis Workflow definition: N/A Task definition: N/A Event handler definition: N/A

To Reproduce Steps to reproduce the behavior:

Go to '...' Click on '....' Scroll down to '....' See error Expected behavior Sweeper to continue sweeping a workflow once a task moves from SCHEDULED to IN_PROGRESS

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

lbestatlas commented 2 weeks ago

I'd like to add some additional context to this issue.

As noted above,

In sweeper flow, If a task is in SCHEDULED state, the un-ack time is set as workflowTimeout if taskDefinition is not present. In other words the sweeper will now only sweep this workflow after workflowTimeout.

This issue has been observed for async System Tasks, but could also occur for SIMPLE tasks if the timeouts are not set on the TaskDefinition but a timeout is set on the Workflow. These types of tasks do not transition from SCHEDULED to IN PROGRESS within a "decide", so the Sweep can pick them up in the SCHEDULED state.

Having a timely workflow sweep is critical in the cases where an execution lock cannot be obtained for some reason, as the decide is deliberately deferred to the sweep in this case. Furthermore, we have seen issues with the JOIN when it was set to synchronous as it does not trigger a decide when it completes (this was resolved when it was reverted to async).

It seems like there should be another setting "maxSweepDelay" to use as the fallback unack time, set either at the workflow level, system level or both.