Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
715 stars 270 forks source link

Purging functions history claims all resources #2211

Closed jsnet-ops closed 2 years ago

jsnet-ops commented 2 years ago

Description

We execute PurgeInstanceHistoryAsync daily to remove all completed, terminated, cancelled or failed items from a single day (24 hours). The amount of items to be removed is 100.000 per day average. After around 15 minutes of purging the system slows down and other functions show reponse times of more than 30 seconds. Looking at the mem and cpu usage, there should not be an issue. Since this amount of items is pretty high, the purging fails after 30 minutes due to timeout but nothing is removed.

Expected behavior

We expect the purging to be done per item. So if a timeout occurs, it should have removed something. Also, purging should not claim all the functionapp resources.

Actual behavior

Too much resources are claimed for other functions to run properly. No items are removed when a timeout occurs.

Relevant source code snippets

        [FunctionName(nameof(PurgeOrchestrationHistory))]
        public async Task Run(
            [DurableClient] IDurableOrchestrationClient client,
            [TimerTrigger(scheduleExpression: "%PurgeOrchestratorHistorySchedule%")] TimerInfo myTimer, ILogger logger)
        {
            try
            {
                logger.LogInformation($"{nameof(PurgeOrchestrationHistory)} started. DaysToKeepOrchestrator:{daysToKeepOrchestrator}.");

                var daysToKeep = daysToKeepOrchestrator;

                // Purge one day at a time, to prevent timeouts
                var dayToRemoveStart = DateTime.UtcNow.Date.AddDays(-daysToKeep-1);
                var dayToRemoveEnd = DateTime.UtcNow.Date.AddDays(-daysToKeep);

                logger.LogInformation($"{nameof(PurgeOrchestrationHistory)} will purge all history between {dayToRemoveStart} and {dayToRemoveEnd}");

                var result = await client.PurgeInstanceHistoryAsync(
                    dayToRemoveStart,
                    dayToRemoveEnd,
                    new List<OrchestrationStatus>
                    {
                        OrchestrationStatus.Completed,
                        OrchestrationStatus.Canceled,
                        OrchestrationStatus.Terminated,
                        OrchestrationStatus.Failed,
                    });

                logger.LogInformation($"{nameof(PurgeOrchestrationHistory)} removed {result.InstancesDeleted} instances for date:{dayToRemoveStart:s} with types: \"Completed\", \"Canceled\", \"Failed\" or \"Terminated\".");

            }
            catch (Exception ex)
            {
                logger.LogError($"Exception occured in {nameof(PurgeOrchestrationHistory)} ex:{ex.Message}");
            }
        }

Known workarounds

Use API to purge per item

App Details

<PackageReference Include="Microsoft.Azure.Functions.Extensions" Version="1.1.0" />
<PackageReference Include="Microsoft.Azure.WebJobs" Version="3.0.33" />
<PackageReference Include="Microsoft.Azure.WebJobs.Extensions" Version="4.0.1" />
<PackageReference Include="Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="2.7.1" />
<PackageReference Include="Microsoft.Azure.WebJobs.Extensions.Storage" Version="4.0.4" />
<PackageReference Include="Microsoft.NET.Sdk.Functions" Version="4.1.1" />

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

If you don't want to share your Function App or storage account name GitHub, please at least share the orchestration instance ID. Otherwise it's extremely difficult to look up information.

cgillum commented 2 years ago

Possible duplicate of https://github.com/Azure/azure-functions-durable-extension/issues/2129.

@sojwil the slowing down is expected because the purge operation can take a large toll on the I/O limits of the Azure Storage account. You won't see any impact on CPU or memory, just Azure Storage I/O, which affects the processing speed of other parts of your app.

We expect the purging to be done per item. So if a timeout occurs, it should have removed something.

I agree with this. Looking at the code for purging, it appears that the logic is implemented in such a way that it tries to query all instances in the provided range before deleting them individually. This is a problem since large scans could consume a lot of resources on the storage account (and potentially local memory) and any timeouts will fail just like you said. This is something we can look into.

As a workaround, check to see if you're able to change your purge logic to run more frequently and with smaller time windows. Purging smaller amounts of data at a time is more likely to be successful and have less impact on your app.

jsnet-ops commented 2 years ago

Thank you for this quick response. Looking forward to the added feature.

cgillum commented 2 years ago

I've made a fix in the DurableTask.AzureStorage dependency which should significantly improve the behavior. It will be included by default in the next extension release.

jsnet-ops commented 2 years ago

Excellent news. Thank you :)

SamVanhoutte commented 2 years ago

I've made a fix in the DurableTask.AzureStorage dependency which should significantly improve the behavior. It will be included by default in the next extension release.

Hey @cgillum , is there something we should do to benefit from your fix/change? Because I have the exact same issues and our application workflows are becoming terribly slow, and every Purge call is timing out, so it feels like we're in a state that only keeps getting worse.

cgillum commented 2 years ago

This fix for this has been included in the latest extension release: v2.8.0