Open arorashivam opened 2 months ago
I'd like to add some additional context to this issue.
As noted above,
In sweeper flow, If a task is in SCHEDULED state, the un-ack time is set as workflowTimeout if taskDefinition is not present. In other words the sweeper will now only sweep this workflow after workflowTimeout.
This issue has been observed for async System Tasks, but could also occur for SIMPLE tasks if the timeouts are not set on the TaskDefinition but a timeout is set on the Workflow. These types of tasks do not transition from SCHEDULED to IN PROGRESS within a "decide", so the Sweep can pick them up in the SCHEDULED state.
Having a timely workflow sweep is critical in the cases where an execution lock cannot be obtained for some reason, as the decide is deliberately deferred to the sweep in this case. Furthermore, we have seen issues with the JOIN when it was set to synchronous as it does not trigger a decide when it completes (this was resolved when it was reverted to async).
It seems like there should be another setting "maxSweepDelay" to use as the fallback unack time, set either at the workflow level, system level or both.
Describe the bug Workflow executions are getting stuck due to tasks taking too long to schedule.
Further debugging details:
SCHEDULED
toIN_PROGRESS
Details Conductor version: 3.20.0 Persistence implementation: Postgres Queue implementation: Dynoqueues Lock: Redis Workflow definition: N/A Task definition: N/A Event handler definition: N/A
To Reproduce Steps to reproduce the behavior:
Go to '...' Click on '....' Scroll down to '....' See error Expected behavior Sweeper to continue sweeping a workflow once a task moves from
SCHEDULED
toIN_PROGRESS
Screenshots If applicable, add screenshots to help explain your problem.
Additional context Add any other context about the problem here.