conductor-oss / conductor

Conductor is an event driven orchestration platform
https://conductor-oss.org
Apache License 2.0
17.84k stars 457 forks source link

Performance degradation caused by WAIT task change to Async #144

Open BatyaPinski opened 5 months ago

BatyaPinski commented 5 months ago

Describe the bug After upgrading our Conductor to version 3.16.0 (from version 3.11.3), we have encountered a significant performance degradation across our system. The degradation is evident in increased CPU usage, memory consumption, and network traffic. Upon investigation, we have identified that the root cause of this performance degradation is the recent change that made the "WAIT" task asynchronous, which was introduced in version 3.14.0.

When we reverted this change, the performance of our system returned to normal levels.

Details Conductor version: 3.16.0 Persistence implementation: Redis Queue implementation: Orkes Queue Lock: Redis

Expected behavior The performance of the system should remain stable after the Conductor upgrade without significant degradation.

Screenshots image

image

Suggested Solution Revert the change that made the "WAIT" task asynchronous to restore optimal system performance.

v1r3n commented 4 months ago

@BatyaPinski we are investigating, earlier the WAIT task relied on the sweeper to complete, which means the guarantees for WAIT task to be completed were at-least 30 seconds (or the frequency at which decider runs). This meant you could not wait for say 30 seconds or less and scaling a system with a LOT of WAIT tasks was tightly coupled to the performance of the sweeper.

Making WAIT solves that issue and allows you to have WAITs that are as little as few seconds.