Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
713 stars 267 forks source link

1-2hr delay between activity task being scheduled and being run #1687

Closed dwdarkstorm1308 closed 3 years ago

dwdarkstorm1308 commented 3 years ago

Description

We have a durable function which is tasked with extracting data from two excel files and producing a list of differences between the data in the files. In an effort to minimise the time taken to extract the data, the orchestration creates two async activities to process each file in parallel. The activity function receives a file location as an argument, extracts relevant data, writes the extracted data to blob storage and returns a file location for the extracted data. It usually takes between 3-4 minutes for each activity to complete. On inspecting the history table it is apparent that there is a long delay between these activity tasks being scheduled and actually being run (1-2 hrs).

Expected behavior

The activity tasks should run within an acceptable time frame once scheduled (after a few seconds).

Actual behavior

The activity tasks take 1 to 2 hours before being run.

Relevant source code snippets

We are creating the activity tasks in the orchestrator as such:

var tasks = new Task<string>[2];
tasks[0] = context.CallActivityAsync<string>("ExtractPriceFileDocumentData", new DocumentExtractionRequest() { PriceFileComparisonRequest = requestMessage, ExtractCurrentFile = true});
tasks[1] = context.CallActivityAsync<string>("ExtractPriceFileDocumentData", new DocumentExtractionRequest() { PriceFileComparisonRequest = requestMessage, ExtractCurrentFile = false });
await Task.WhenAll(tasks);

The activity function looks like:

[FunctionName("ExtractPriceFileDocumentData")]
public async Task<string> ExtractPriceFileDocumentData([ActivityTrigger] DocumentExtractionRequest extractionRequest, ILogger log)
{
    var extractedFileLocation = string.Empty;

    var sourceFileLocation = extractionRequest.PriceFileComparisonRequest.CurrentSubmissionFileLocation;
    var sourceFileSubmissionId = extractionRequest.PriceFileComparisonRequest.CurrentSubmissionId;

    var sourceFileBytes = await StorageHelper.GetFileFromStorage(sourceFileLocation);

    var documentData = ConstructDocumentData(sourceFileBytes, extractionRequest.PriceFileComparisonRequest.PriceUploadRequest);

    var serializedData = JsonConvert.SerializeObject(documentData);
    var dataBytes = Encoding.UTF8.GetBytes(serializedData);
    using (MemoryStream dataStream = new MemoryStream(dataBytes))
    {
        string fileLocationOfData = $"{Configuration[Constants.ConfigurationSetting.FileStorageDirectory]}/extracted_{sourceFileSubmissionId}.json";
        if (!await StorageHelper.WriteFileToStorage(dataStream, fileLocationOfData))
        {
            fileLocationOfData = null;
        } else
        {
            extractedFileLocation = fileLocationOfData;
        }
    }

    return extractedFileLocation;
}

Known workarounds

None found at this time.

App Details

If deployed to Azure

olitomlinson commented 3 years ago

@dwdarkstorm1308

Nothing jumps out to me with your code which could be wrong.

Please can you share your host.json configuration?

It’s worth considering that if you’re running on consumption plan there is a 5 minute execution limit per function invocation by default, which you may be encroaching on...

Have you tried running this code locally to see if the same thing occurs? (You can run against the local storage emulator, or a real Azure Storage Account, but I would try both)

dwdarkstorm1308 commented 3 years ago

@olitomlinson

Thanks for your response. Our host.json file doesn't really contain anything to speak of, it is as follows:

{
  "version": "2.0",
  "functionTimeout": "00:08:00"
}

Our function is indeed running on a consumption plan, as you can see I have tried to address the 5 minute limit, although so far I have not seen any of the activity functions taking as long as 5 minutes when they eventually do run. It's just the time from when they are scheduled to when they begin to execute i am seeing 1-2 hr delays in production.

When running locally using the local storage emulator the entire process is completed in under 10 minutes from the initial trigger of the orchestrator, to all the tasks being completed. I see no delays at all. I am only seeing this issue once deployed into production in Azure. Perhaps I could run locally against a real storage account. If that was successful in reproducing the issue, i'm not really sure what that would tell me beyond what I can already see in production.

olitomlinson commented 3 years ago

@dwdarkstorm1308 thanks!

Sometimes there are scale controller issues with Az Functions in consumption model, which prevent orchestrations and activities from being processed as expected. It’s not unheard of for the scale controller to fail to acknowledge that there is work that needs to be done, so your App scales to nothing, until something kicks the scale controller and your workloads get picked up.

Running locally against a real Azure Storage Account eliminates the scale controller as a possible cause. So it’s just another data point to help diagnose where the problem might lie.

if running locally against a real storage account does cause the same problem, then the only common element in all failure scenarios is the Storage Account. If so, the next step would be to provision a new storage account, and test against that instead, and see if the result changes.

Another thought... Do you have the same storage account being utilised for multiple DF Apps in prod? Or is it just the one DF app? I ask because by default Apps use the same TaskHub name, so you can get in a mess here if two Apps are pointing to the same Storage Account with the same default TaskHub name. Not sure if this is the issue but it’s worth asking.

dwdarkstorm1308 commented 3 years ago

@olitomlinson

Thanks for the suggestions above, i am in the process of making changes to our Function App including ensuring we have set a TaskHub name. Whilst we only have one function app, it is deployed to a staging and production slot and I therefore assume that this is pretty much the same as having two different apps pointing at the same storage. I will configure the instances so that each slot has a different taskhub name set in the configuration.

I will let you know how we progress once the changes are made and we have completed some testing.

olitomlinson commented 3 years ago

@dwdarkstorm1308

You’re correct - multiple slots running against the same Storage Account, for all intents, is the same as two different Function Apps targeting the same Storage Account. It’s likely that this could be the problem you’re experiencing.

I agree, Having each slot target a different TaskHub name should resolve this.

@cgillum

In this instance, is the App/s attempting to run in a geo-redundant configuration?

Is this a symptom of default TaskHub name being provided automatically? I’m curious to wonder if TaskHub name should be moved to an explicit property now that we have geo-redundancy adding to the complexity?

I get the desire to shield customers from as much configuration that exists behind the TaskHub curtain as possible, but something doesn’t quite feel right with this scenario.

ConnorMcMahon commented 3 years ago

So if the app is running on a non-production slot, it should require a taskhub to be specified, specifically to prevent issues with conflicting task hub names (the default task hub name is derived from the app name in Durable 2.0 and greater).

I'm going to take a look at the logs in Azure to see if I can diagnose what's happening here.

ConnorMcMahon commented 3 years ago

@dwdarkstorm1308,

Root cause: It appears you are hitting the storage deadlock (or extreme storage latency) repeatedly. When we see a storage request take longer than 2 minutes, we assume that this is tied to a deadlock in the storage SDK (mainly due to past experience). If we detect this deadlock 5 times on a single worker, we kill the app, as we are afraid the worker is in a bad state, and we don't want to continue loosing threads.

This is actually a recent improvement, as pre 2.4.1 we would restart the app the first time we detect this deadlock.

I see your app hit this restart condition every so often. When it does, the inflight activities are killed, and will be retried when the visibility timeout on the activity queue message expires (5 minutes by default).

Because your app is hitting this deadlock repeatedly, your app keeps crashing repeatedly when trying to execute ExtractPriceFileDocumentData. After 32 tries, it finally succeeded without crashing, and was able to persist that it completed.

As to how to prevent your app from hitting this, we have observed that the storage SDK deadlocks more frequently at high CPU consumption. This checks out with the behavior I see for your app.

CPU usage: image

App crashes: image

The correlation is striking.

Mitigation: The goal is to try and present your app from maxing out CPU usage. You can try throttling the number of concurrent activity functions, or try to reduce CPU usage in your functions (you may have a really tight computation/event loop that you could slow down).

Long term fix:

Obviously this is a huge failure case for high CPU usage, and is something that should be handled on our end. We have tried to mitigate the impact of this storage deadlock, and although there are some further ways we could more gracefully mitigate (for instance, "drain" the worker before killing it), the only way to truly kill this class of issue is to update our version of the storage SDK.

We have plans to do that in the very near future, but as you can guess, this is going to be a bit of an undertaking for a project of this scope, so it may take a few months to implement.

I hope that helps answer your questions, and sorry you hit this scenario. We know this causes a lot of pain, and it is a top priority for us to tackle.

olitomlinson commented 3 years ago

So if the app is running on a non-production slot, it should require a taskhub to be specified, specifically to prevent issues with conflicting task hub names (the default task hub name is derived from the app name in Durable 2.0 and greater).

I'm going to take a look at the logs in Azure to see if I can diagnose what's happening here.

@ConnorMcMahon

given that there is no task hub specified in the provided host.json, does this indicate a regression?

ConnorMcMahon commented 3 years ago

Hmm, there does seem to be some regression here, because we see the staging slot pop up with a taskhub that it shouldn't be able to. I honestly wouldn't be suprised if that's resulting in the increased CPU usage, because both the slot and the app are trying to execute activity functions at the same time.

dwdarkstorm1308 commented 3 years ago

@olitomlinson @ConnorMcMahon

Thank you both for your time on this.

I have made some changes and want to report back my results but have been distracted by other things. I will update this post shortly as soon as I get 5 mins.

olitomlinson commented 3 years ago

@dwdarkstorm1308 you're welcome!

Did you have any success with your changes?

dwdarkstorm1308 commented 3 years ago

First of all, apologies for not getting back to you sooner. @olitomlinson @ConnorMcMahon

If anyone is interested in the solution but doesn't want to read through my waffle (TL;DR), please skip down to the solution.

So one of the first things I wanted to address was the long execution time of the activity that was taking the 4-5 minutes to complete. We split this in to smaller units of work i.e. each activity was given a range of rows to process from the source document (this does have the side effect of each activity having to fetch and load into memory the entire document to process a subset of the data). This actually caused us an issue when it came to scaling out, which i have detailed below.

Whilst implementing the changes to ensure that each slot was using a separate task hub name we came across an issue where we found we were injecting our own IConfiguration (for DI purposes) which was overwriting the configuration source which had the host.json settings. Fixing this issue ensured that the values we had set in the application configuration (via host.json) were being used for each application instance.

We tested our app in development against a provisioned storage account (rather than using the azure storage emulator) and found that our application completed in a time comparable to what we saw using the emulated storage. We also provisioned a new storage account and linked it to our production application to test the results and found that our issues persisted.

Our next step was to address the storage locking issue raised by @ConnorMcMahon. We spent some time implementing some configuration based knobs that we could twist without having to keep making deployments. This allowed us to adjust the number of concurrent activities being generated as well as the amount of work each activity was undertaking, as well as frequency of triggering Thread.Sleep commands and duration. We would have preferred to steer away from using Thread.sleep. However this was very effective in slowing down the CPU usage and therefore preventing the storage deadlocking.

Taking action on the task hub names and combining that with changes that allowed the CPU some room to process storage requests allowed us to see major improvements in timing but was still slow vs development. i.e. we went from 1-2 hr times for our process to complete to ~15-25 minutes vs ~4-5 minutes in dev.

I then started to take a look at how the function application was scaling, using the configuration knobs that were implemented. We discovered that there are roughly 30-60s delays in standing up a new instance. So even when had activities doing small subsets of data, but many activities scheduled. There were delays while the scale controller stood up new instances and monitored impact. We also observed that scaling down occurred while work loads were decreasing, meaning that some tasks had to wait before they were picked up. Once this was observed I switched back to using a single activity to process 30,000 rows (but now with Thread.sleep triggered at set numbers of iterations over those rows) and we now see times almost as equivalent to those seen during development.

Thanks again for the pointers you gave me above they were instrumental in us being able to address issues in the function app and have it working as we expected.

Solution: So for our particular use case, where we had a few (~4) high CPU activities which lasted a few minutes (4-5) at a time.

The solution came in ensuring that we were using separate taskhub names for each slot and ensuring we were passing existing services through when configuring our DI. These changes ensured each slot (application instance) was using separate storage/queues.

Adding a Thread.sleep() (although this feels a bit hacky) to execute after so many iterations of the computational loop, allowed some breathing room and the CPU was able to process requests for storage. This prevented the storage deadlocks and therefore our function app being killed off.

Attempting to break the high CPU activities into smaller units of work and take advantage of scaling out, didn't help in this case due to the delay between an instance being stood up, the scale controller monitoring the situation and then deciding to set up another instance (approx 30-60 seconds). Also I observed that scaling down occurs, before all of the activities had been processed. (Ideally for our use case it would have worked better if the app scaled down when each instance had no work to do, rather than when there was low amounts of work to do). I should note that had we been running, high volumes of short running activities (and didn't have the requirement to complete processing as quickly as possible), I suspect this would have been the way forward.