Question: Performance and Queuing Mechanism

keelerm84 commented 4 years ago

I have a question regarding workflow throughput and how it relates to the queuing mechanism.

Workflow Overview

Download a document.
Send document to preprocessing service via an API call
Wait for an event (which will be emitted when preprocessing service has completed request async)
Send preprocessed data to machine learning inference service via an API call
Wait for an event (which will be emitted when the ML service has completed request async)
Hit a callback API with the results of the workflow.

I am using the Redis provider for queuing and locking, though it should be noted that only a single node is running for this test.

Performance Results If I execute a single request, the entire process takes about 3.6 seconds, which is great. I started sending more and more requests through the application to see how responses work at scale. The time for the first request to respond scaled essentially linearly with the number of requests. So if I send through 10 requests, the first one won't complete until 4.7 seconds. At 100 requests, 25.12 seconds. At 500, 107.35.

I have included a graph of the min, max and avg responses time as I've increased the number of jobs below.

What this seems to suggest to me is that steps are more or less being processed sequentially across all workflows. Meaning for request 1, we don't process step 3 until all step 2 workflow requests have completed. I can see why that might be the case, particularly since Redis queuing is a simple list structure.

Given that steps 2 and 4 are asynchronous, external APIs, we could see a massive increase in processing time if we could take advantage of that information. Do you have any recommendations on how to achieve something like this?

Please let me know if there is any more information I can provide for you.

keelerm84 commented 4 years ago

Adding a little more information about this problem.

Sending 100 requests across 10 concurrent workers -- average processing time is 41 seconds.
Sending 100 requests with a single worker -- average processing time is 33 seconds.
Sending 100 requests with a small delay (<0.01s) between requests -- average processing time is 17 seconds.

Perhaps the global locking mechanism is stepping on it's own toes and artificially adding a delay to the processing time?

danielgerlag commented 4 years ago

Interesting observations! I think we'd need to do some testing around how spikes of new workflow affect performance, the locking mechanism sounds like a prime suspect.

keelerm84 commented 4 years ago

Let me know if there are particular tests you'd like me to run and I can do my best to get you that information.

ccavin commented 4 years ago

I have noticed this as well but I'm working with SQL server rather than a Redis queue. Specifically, it seems when a workflow step completes, it is put on the back of the queue of workflow Id's to process. The result is something that looks like a round robin schedule - process step 1 on all workflows, then process step 2, etc.

This is especially noticeable on startup if you have a bunch of workflows in a runnable state (say, recovering from a node failure or a node restart). They all get dumped into the queue and are processed one step at a time in lock step.

I would bee keen to see some configurable control over the scheduling mechanism - definitely the option to specify FIFO processing (each workflow based on submission time executes as many steps as possible until an intentional delay() or error), maybe scheduling with a priority indicator, and possibly an option with multiple queues (maybe, one for in-memory, quick, non-persistent work and one for persisted, long-running work - <edit> nevermind - I just found SyncWorkflowRunner).

Anyway, I'm just thinking as I type at the moment, but I have experienced this behavior as well so thought I'd chime in.

ccavin commented 4 years ago

Specific to the behavior referenced, I think the issue is with the WorkflowExecutor::Execute() method. It retrieves all of the currently executable ExecutionPointers, loops through them, and then places the workflow back on the queue for future execution. It doesn't take into account the fact that executing an ExecutionPointer may generate another executable ExecutionPointer. So, I think it should be in a while loop: var exePointers = new List(workflow.ExecutionPointers.Where(x => x.Active && (!x.SleepUntil.HasValue || x.SleepUntil < _datetimeProvider.UtcNow))); while (exePointers.Count > 0) { foreach (var pointer in exePointers) { ... } exePointers = new List(workflow.ExecutionPointers.Where(x => x.Active && (!x.SleepUntil.HasValue || x.SleepUntil < _datetimeProvider.UtcNow))); }

VishnuReghuMDSL commented 12 months ago

@danielgerlag We are experiencing this issue as well. Is there any plans to resolve this in near future. Thanks

winchelll commented 5 months ago

Is there any update or resolution on this issue?

danielgerlag / workflow-core

Question: Performance and Queuing Mechanism #595