conductor-oss / conductor

Conductor is an event driven orchestration platform
https://conductor-oss.org
Apache License 2.0
15.57k stars 414 forks source link

Bug: Race Condition in WorkflowSweeper Leading to Inconsistent Workflow States #213

Open rq-dbrady opened 1 month ago

rq-dbrady commented 1 month ago

Describe the bug We have identified a race condition in the WorkflowSweeper class, which causes workflows to be in inconsistent states across different threads. This issue is critical as it affects the reliability and correctness of workflow execution and completion checks.

Details Conductor version: 3.17 Persistence implementation: Postgres,Opensearch Queue implementation: RedisCluster Lock: Redis

Steps to Reproduce: Deploy the application with at least 30 replicas in a Kubernetes environment. Use a high sweeper rate of about 25ms and a high thread count. Use a Redis cluster with Redis lock for workflow execution. Execute workflows at a rate of approximately 75-90 workflows per second. Monitor the state of workflows and observe for inconsistencies.

Observed Behavior Workflows are fetched from executionDaoFacade before acquiring a lock. The verifyAndRepair method mutates the workflow state without proper synchronization. The workflow lock is released before the workflow is removed from the queue. These conditions create a time window of roughly 50µ to 100µ seconds where a workflow can be in two states concurrently on different threads. Workflow listeners or completion checks may fail as a result, with workflows erroneously marked as "Running" even after triggering the finish.

Expected Behavior Workflows should maintain consistent states across all threads. Proper locking should be enforced to prevent state mutations without synchronization. Workflow locks should only be released after the workflow is securely removed from the queue.

Screenshots

Screenshot 2024-07-18 at 10 58 28
v1r3n commented 1 month ago

Hi @rq-dbrady we are investigating.