Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
715 stars 271 forks source link

Prioritize activities queue #1067

Open anthonychu opened 4 years ago

anthonychu commented 4 years ago

Transferred from UserVoice for discussion.

Original suggestion from Oleksandr:

On my project we tried using durable functions with queue-trigger to process resource-consuming requests at a steady pace instead of being overwhelmed with hundreds of messages arriving at the same time. Fairly basic scenario for queues isn't it? But after few weeks of conversations with Azure Support team we ended up with conclusion that it isn't possible to achieve and had to remake project to other technology.

Need to say we had to configure maxConcurrentActivityFunctions setting to limit the load on our SQL DB, so "processing power", available for orchestrators got limited.

This resulted in situation when hundreds of messages arrived to the queue, hundreds of orchestrators got started in parallel. Each of them called dozens of activities one by one. And since all activities were just put into queue, every orchestrator receives very few processing time. If we consider graph of all activities, it is processed according to BFS algorighm. I would much prefer DFS though. If orchestrator needs 1 minute to complete and we have 100 of them, I would expect them to be processed one by one, so in 1 minute 1 task completed and 99 not even started, in 2 minutes - 2 tasks completed etc; but current behavior is that it starts working on 100 different tasks in parallel and cannot complete single one for an hour. That's very important if there is some timeout associated with this workflow - it's better to have 60 tasks complete in an hour timeout and remaining 40 fail without even starting than failing all 100 due to timeout.

It's tricky to find ideal activity planning strategy, but ideally we should aim to give more processing time to orchestrators which are closest to completion and leave longer-running tasks for later. Estimation of remaining time of each orchestrator can be done either manually, or based on AppInsights and some AI. But if that will be too complex, following the 1990th approach with assigning an integer priority (like Windows Thread planner does) for an orchestrator may be good enough idea too. And among same-priority orchestrators choose the one which has been started earlier.

olitomlinson commented 4 years ago

If orchestrator needs 1 minute to complete and we have 100 of them, I would expect them to be processed one by one, so in 1 minute 1 task completed and 99 not even started, in 2 minutes - 2 tasks completed etc

If I understand this correctly, the user wishes to allow just one orchestration to be in a state of running at any one point in time.

Full disclosure, I've never tried this, but I think this can be achieved by setting "partitionCount" : 1 which will ensure all orchestrations will get processed by the same VM, thereby guaranteeing the effectiveness of setting "maxConcurrentOrchestratorFunctions" : 1 so that the next orchestration can't start until the previous one has reached a terminal status.

Additionally, the user could then control the concurrency up 16 concurrent orchestrations if he or she wished to do by increasing the partitionCount property.

cgillum commented 4 years ago

@anthonychu What do you think about @olitomlinson's suggestion? It seems reasonable to me. Perhaps we need another section in our Performance and Scale documentation which describes how to limit work like this?

Boaz101 commented 4 years ago

I am also seeing something similar if I set partitionCount to 1, maxConcurrentActivityFunctions to 1, and maxConcurrentOrchestratorFunctions to 1. Then I take the durable function template which says hello to different cities and I place a Task.Delay for 30s in the SayHello activity function (changing it to async). If I kick off 15 instances of the orchestrator, the first instance takes 15min because it appears to execute a random activity function across all running instances to run at a time. It doesn't appear to favor activity functions from earlier instances. However, this would be very useful so that long running multi-part workflows complete in a reasonable amount of time and later requests wait.

ristaloff commented 4 years ago

@cgillum I'm having the same issue as Oleksandr. Setting maxConcurrentOrchestratorFunctions = 1. And/Or partitionCount = 1 does not help. I have an azure function with a ServiceBusTrigger that starts a new durable function whenever it receives a new message. If I send 10 messages to the queue, the serviceBusTrigger function will start 10 durable functions. Now all these 10 orchestrations will take turns to process their activities and sub-orchestrations. And they will all finish approximately at the same time. This is ok when receiving 10 messages a minute. But I sometimes get spikes of 500 messages in a minute. One orchestration takes about 20-45s to run by itself. That means I have to wait 500 * 30s ~ 4h to process the first message received.

So the maxConcurrentOrchestratorFunctions=1 does not keep me from having more than 1 orchestration in the "running" state at one time. And another note: Running activities and sub-orchestrations from the first orchestration should have priority until over the others until it is finished right?

olitomlinson commented 4 years ago

Hmmm in that case, I’m really not sure what the purpose of ‘macConcurrentOrchestratorFunctions’ configuration is if it doesn’t limit how many are running?

ristaloff commented 4 years ago

Found this: https://github.com/Azure/azure-functions-durable-extension/issues/730 So it works as it was designed. It only limits number of orchestrations held in memory. But how can I limit the number of orchestrations running? I guess I could try to enable extendedSessions. And set it to 1 minute. Then they will count towards the maxConcurrentOrchestratorFunctions. And most will complete within 1 minute. Would be nice if the extend session timeout could be reset for each activity that completes within the orchestration, to allow the whole orchestration to complete before time runs out.

olitomlinson commented 4 years ago

@ristaloff good find!

@cgillum I think the docs might need some clarification on this setting? The biggest thing for me is I didn’t realise that when an orchestration was awaiting, it wouldn’t count towards the limit.

Going back to the issue, I actually think there might be a user-code solution to this problem, by using an eternal orchestration as a coordinator of when to start other orchestrations which need to be processed serially, but I’d need to try it out.

cgillum commented 4 years ago

Correct, the maxConcurrentOrchestratorFunctions setting is meant to control the number of orchestrations that are active in memory at once. It cannot be used to serialize the execution of running orchestrations. I'll look into clarifying this in the documentation.

But yes, the way to accomplish this would be to use another orchestration to do the global enforcement. I haven't tested it, but something like this might work (C#):

[FunctionName("OrchestrationCoordinator")]
public static async Task CallOrchestrator([OrchestrationTrigger] IDurableOrchestrationContext ctx)
{
    var startArgs = await ctx.WaitForExternalEvent<StartOrchestrationArgs>("StartOrchestration");

    await ctx.CallSubOrchestratorAsync<object>(
        startArgs.FunctionName,
        startArgs.InstanceId,
        startArgs.Input);

    ctx.ContinueAsNew(null, preserveUnprocessedEvents: true);
}
ristaloff commented 4 years ago

I managed to run one orchestration at a time with the settings below. But It would also wait 90s after each orchestration was completed until the next would start.

"extensions": {
    "durableTask": {
      "extendedSessionsEnabled": true,
      "extendedSessionIdleTimeoutInSeconds": 90,
      "maxConcurrentActivityFunctions": 10,
      "storageProvider": {
        "partitionCount": 1
      },
      "maxConcurrentOrchestratorFunctions": 1
    }
  },

Also does Sub-Orchestrations count towards maxConcurrentOrchestratorFunctions ?

cgillum commented 4 years ago

Yes, sub-orchestrations do count against this limit.

ristaloff commented 4 years ago

Hi, I tried to use an OrchestrationCoordinator function as proposed. When messages are received I raise an event on the OrchestrationCoordinator instance. It will then process the messages one by one. But, the raised events are only hold in memory. So by restarting app, we risk to loose all messages currently held in memory.

I think my only viable option now is to save messages to storage when they are received, and then have an eternal orchestrator or timertrigger to read unprocessed messages from storage and process them.

Seems to me that this could be a feature of the durable task framework.

mpaul31 commented 4 years ago

@cgillum is the above true regarding the unprocessed events stored in-memory and not durably stored along with the orchestration state?

cgillum commented 4 years ago

No, I don’t believe so. An app restart should never result in any data loss. Messages that are buffeted in-memory are still backed by Durable storage. The only case where I might expect messages to get dropped is if you do ContinueAsNew but don’t specify preserveUnprocessedEvents: true.

shyamal890 commented 1 year ago

Any long term solution without the workaround suggested by @cgillum ?

Also context.ContinueAsNew(null, preserveUnprocessedEvents: true); is not available in javascript v4 function it seems.

davidmrdavid commented 1 year ago

@shyamal890: can you please open an issue on the JS repo about this missing feature -> https://github.com/Azure/azure-functions-durable-js/? Thanks